質問

I am on the lookout for a regular expression in R to extract the fields given in an .sdf chemical data file. The fields in this case are delimited by < > and follow a "> " at the start of a line. E.g. in the case of

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

it would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they could occur multiple times, so I would need only the unique() ones) Any thoughts?

cheers, Tom

役に立ちましたか?

解決

Try this. It extracts the portion of the string matched by the part of the regular expression surrounded by parentheses and then simplifies it using unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED minor simplification.

他のヒント

You can use gregexpr and regmatches to extract the substrings and unique to remove duplicates.

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top