Regular expression to extract unique fields from .sdf file in R

https://stackoverflow.com/questions/21959946

15-10-2022
|

Question

I am on the lookout for a regular expression in R to extract the fields given in an .sdf chemical data file. The fields in this case are delimited by < > and follow a "> " at the start of a line. E.g. in the case of

string=">  <FIELD1>\nfield text1\n\n>  <FIELD2>\nfield text2\n\n>  <FIELD3>field text3"

it would have to return

fields=c("FIELD1","FIELD2","FIELD3")

(they could occur multiple times, so I would need only the unique() ones) Any thoughts?

cheers, Tom

Solution

Try this. It extracts the portion of the string matched by the part of the regular expression surrounded by parentheses and then simplifies it using unique:

library(gsubfn)
strapplyc(string, "<([^>]*)>", simplify = unique)

giving:

[1] "FIELD1" "FIELD2" "FIELD3"

REVISED minor simplification.

OTHER TIPS

You can use gregexpr and regmatches to extract the substrings and unique to remove duplicates.

unique(regmatches(string, gregexpr("(?<=<)\\w+(?=>)", string, perl = TRUE))[[1]])
# [1] "FIELD1" "FIELD2" "FIELD3"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow