Question

I am on the lookout for a regular expression in R to extract the fields given in an .sdf chemical data file. The fields in this case are delimited by < > and follow a "> " at the start of a line. E.g. in the case of

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

it would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they could occur multiple times, so I would need only the unique() ones) Any thoughts?

cheers, Tom

Was it helpful?

Solution

Try this. It extracts the portion of the string matched by the part of the regular expression surrounded by parentheses and then simplifies it using unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED minor simplification.

OTHER TIPS

You can use gregexpr and regmatches to extract the substrings and unique to remove duplicates.

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top