Domanda

I am on the lookout for a regular expression in R to extract the fields given in an .sdf chemical data file. The fields in this case are delimited by < > and follow a "> " at the start of a line. E.g. in the case of

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

it would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they could occur multiple times, so I would need only the unique() ones) Any thoughts?

cheers, Tom

È stato utile?

Soluzione

Try this. It extracts the portion of the string matched by the part of the regular expression surrounded by parentheses and then simplifies it using unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED minor simplification.

Altri suggerimenti

You can use gregexpr and regmatches to extract the substrings and unique to remove duplicates.

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top