Pergunta

In R I'd like to take a collection of file names in the format below and return the number to the right of the second underscore (this will always be a number) and the text string to the right of the third underscore (this will be combinations of letters and numbers).

I have file names in this format:

HELP_PLEASE_4_ME

I want to extract the number 4 and the text ME

I'd then like to create a new field within my data frame where these two types of data can be stored. Any suggestions?

Foi útil?

Solução

Here is an option using regexec and regmatches to pull out the patterns:

matches <- regmatches(df$a, regexec("^.*?_.*?_([0-9]+)_([[:alnum:]]+)$", df$a))
df[c("match.1", "match.2")] <- t(sapply(matches, `[`, -1)) # first result for each match is full regular expression so need to drop that.

Produces:

                 a match.1 match.2
1 HELP_PLEASE_4_ME       4      ME
2  SOS_WOW_3_Y34OU       3   Y34OU

This will break if any rows don't have the expected structure, but I think that is what you want to happen (i.e. be alerted that your data is not what you think it is). strsplit based approaches will require additional checking to ensure that your data is what you think it is.

And the data:

df <- data.frame(a=c("HELP_PLEASE_4_ME", "SOS_WOW_3_Y34OU"), stringsAsFactors=F)

Outras dicas

The obligatory stringr version of @BrodieG's quite spiffy answer:

df[c("match.1", "match.2")] <- 
  t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3))

Put here for context only. You should accept BrodieG's answer.

Since you already know that you want the text that comes after the second and third underscore, you could use strsplit and take the third and fourth result.

> x <- "HELP_PLEASE_4_ME"
> spl <- unlist(strsplit(x, "_"))[3:4]
> data.frame(string = x, under2 = spl[1], under3 = spl[2])
##             string under2 under3
## 1 HELP_PLEASE_4_ME      4     ME

Then for longer vectors, you could do something like the last two lines here.

## set up some data
> word1 <- c("HELLO", "GOODBYE", "HI", "BYE")
> word2 <- c("ONE", "TWO", "THREE", "FOUR")
> nums <- 20:23
> word3 <- c("ME", "YOU", "THEM", "US")
> XX <-paste0(word1, "_", word2, "_", nums, "_", word3)
> XX
## [1] "HELLO_ONE_20_ME"    "GOODBYE_TWO_21_YOU" 
## [3] "HI_THREE_22_THEM"   "BYE_FOUR_23_US"    
## ------------------------------------------------
## process it
> spl <- do.call(rbind, strsplit(XX, "_"))[, 3:4]
> data.frame(cbind(XX, spl))
##                   XX V2   V3
## 1    HELLO_ONE_20_ME 20   ME
## 2 GOODBYE_TWO_21_YOU 21  YOU
## 3   HI_THREE_22_THEM 22 THEM
## 4     BYE_FOUR_23_US 23   US
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top