문제

I am on the lookout for a regular expression in R to extract the fields given in an .sdf chemical data file. The fields in this case are delimited by < > and follow a "> " at the start of a line. E.g. in the case of

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

it would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they could occur multiple times, so I would need only the unique() ones) Any thoughts?

cheers, Tom

도움이 되었습니까?

해결책

Try this. It extracts the portion of the string matched by the part of the regular expression surrounded by parentheses and then simplifies it using unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED minor simplification.

다른 팁

You can use gregexpr and regmatches to extract the substrings and unique to remove duplicates.

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top