splitting filename text by underscores using R

Question 1

Here is an option using regexec and regmatches to pull out the patterns:

matches <- regmatches(df$a, regexec("^.*?_.*?_([0-9]+)_([[:alnum:]]+)$", df$a))
df[c("match.1", "match.2")] <- t(sapply(matches, `[`, -1)) # first result for each match is full regular expression so need to drop that.

Produces:

                 a match.1 match.2
1 HELP_PLEASE_4_ME       4      ME
2  SOS_WOW_3_Y34OU       3   Y34OU

This will break if any rows don't have the expected structure, but I think that is what you want to happen (i.e. be alerted that your data is not what you think it is). strsplit based approaches will require additional checking to ensure that your data is what you think it is.

And the data:

df <- data.frame(a=c("HELP_PLEASE_4_ME", "SOS_WOW_3_Y34OU"), stringsAsFactors=F)

Question 2

The obligatory stringr version of @BrodieG's quite spiffy answer:

df[c("match.1", "match.2")] <- 
  t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3))

Put here for context only. You should accept BrodieG's answer.

Question 3

Since you already know that you want the text that comes after the second and third underscore, you could use strsplit and take the third and fourth result.

> x <- "HELP_PLEASE_4_ME"
> spl <- unlist(strsplit(x, "_"))[3:4]
> data.frame(string = x, under2 = spl[1], under3 = spl[2])
##             string under2 under3
## 1 HELP_PLEASE_4_ME      4     ME

Then for longer vectors, you could do something like the last two lines here.

## set up some data
> word1 <- c("HELLO", "GOODBYE", "HI", "BYE")
> word2 <- c("ONE", "TWO", "THREE", "FOUR")
> nums <- 20:23
> word3 <- c("ME", "YOU", "THEM", "US")
> XX <-paste0(word1, "_", word2, "_", nums, "_", word3)
> XX
## [1] "HELLO_ONE_20_ME"    "GOODBYE_TWO_21_YOU" 
## [3] "HI_THREE_22_THEM"   "BYE_FOUR_23_US"    
## ------------------------------------------------
## process it
> spl <- do.call(rbind, strsplit(XX, "_"))[, 3:4]
> data.frame(cbind(XX, spl))
##                   XX V2   V3
## 1    HELLO_ONE_20_ME 20   ME
## 2 GOODBYE_TWO_21_YOU 21  YOU
## 3   HI_THREE_22_THEM 22 THEM
## 4     BYE_FOUR_23_US 23   US