Domanda

One for the regex enthusiasts. I have a vector of strings in the format:

<TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Verdana" STYLE="font-size: 10px" size="10" COLOR="#FF0000" LETTERSPACING="0" KERNING="0">Desired output string containing any symbols</FONT></P></TEXTFORMAT>

I'm aware of the perils of parsing this sort of stuff with regex. It would however be useful to know how to efficiently extract an output sub-string of a larger string match - i.e. the contents of angle quotes >...< of the font tag. The best I can do is:

require(stringr)
strng = str_extract(strng, "<FONT.*FONT>") # select font statement
strng = str_extract(strng, ">.*<")         # select inside tags
strng = str_extract(strng, "[^/</>]+")     # remove angle quote symbols

What would be the simplest formula to achieve this in R?

È stato utile?

Soluzione

Use str_match, not str_extract (or maybe str_match_all). Wrap the part that you want to extract match in parentheses.

str_match(strng, "<FONT[^<>]*>([^<>]*)</FONT>")

Or parse the document and extract the contents that way.

library(XML)
doc <- htmlParse(strng)
fonts <- xpathSApply(doc, "//font")
sapply(fonts, function(x) as(xmlChildren(x)$text, "character"))

As agstudy mentioned, xpathSApply takes a function argument that makes things easier.

xpathSApply(doc, "//font", xmlValue)

Altri suggerimenti

You can also do it with gsub but I think there are too many permutations to your input vector that may cause this to break...

gsub( "^.*(?<=>)(.*)(?=</FONT>).*$" , "\\1" , x , perl = TRUE )
#[1] "Desired output string containing any symbols"

Explanation

  • ^.* - match any characters from the start of the string
  • (?<=>) - positive lookbehind zero-width assertion where the subsequent match will only work if it is preceeded by this, i.e. a >
  • (.*) - then match any characters (this is now a numbered capture group)...
  • (?=</FONT>) - ...until you match "</FONT>"
  • .*$ - then match any characters to the end of the string

In the replacement we replace all matched stuff by numbered capture group \\1, and there is only one capture group which is everything between > and </FONT>.

Use at your peril.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top