Question

I have many filenames which look like:

txt= "MA0051_IRF2.xml"

I want to extract IRF2 which is between "_" and ".". How do I do this in R?

Was it helpful?

Solution

To achieve this, you need a regexp that

  • matches an (optional) arbitrary string in front of the _ : .*
  • matches a literal _ : [_]
  • matches everything up to (but not including) the next . and stores it in capturing group no. 1 : ([^.]+)
  • matches a literal . : [.]
  • matches an (optional) arbitrary string after the . : .*

In your call to gsub, you then

  • use the regular expression we built in the previous step
  • replace the whole string with the contents of the first capturing group: \\1 (we need to escape the backslash, hence the double backslash)

Example:

gsub(".*[_]([^.]+)[.].*", "\\1", "MA0051_IRF2.xml")

OTHER TIPS

an other possibility with the stringr package:

 str_extract(x, perl("(?<=_)(.+)(?=\\.)"))
gsub(".*_(.*)\\..*", "\\1", txt)
##"IRF2"

Here's a possible solution that doesn't require regex knowledge:

txt <- "MA0051_IRF2.xml"

library(qdap)
genXtract(txt, "_", ".")

## _  :  . 
##  "IRF2" 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top