Question

I have a vector containing some names. I want to extract the title on every row, basically everything between the ", " (included the white space) and "."

> head(combi$Name)
[1] "Braund, Mr. Owen Harris"
[2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
[3] "Heikkinen, Miss. Laina"
[4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
[5] "Allen, Mr. William Henry"
[6] "Moran, Mr. James"

I suppose gsub might come useful but I have difficulties on find the right regular expressions to accomplish my needs.

Was it helpful?

Solution

1) sub With sub

> sub(".*, ([^.]*)\\..*", "\\1", Name)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

1a) sub variation This approach with gsub also works:

> sub(".*, |\\..*", "", Name)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

2) strapplyc or using strapplyc in the gusbfn package it can be done with a simpler regular expression:

> library(gsubfn)
>
> strapplyc(Name, ", ([^.]*)\\.", simplify = TRUE)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

2a) strapplyc variation This one seems to have the simplest regular expression of them all.

> library(gsubfn)
>
> sapply(strapplyc(Name, "\\w+"), "[", 2)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

3) strsplit A third way is using strsplit

> sapply(strsplit(Name, ", |\\."), "[", 2)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

Added additional solutions. Changed gsub to sub (although gsub works too).

OTHER TIPS

Not to note that there's anything lacking from G. Grothendieck's answer. I just want to add a solution using sub and non-greedy repetition:

vec <- c("Moran, Mr. James",
         "Rothschild, Mrs. Martin (Elizabeth L. Barrett)")

sub(".*, (.+?)\\..*", "\\1", vec)
# [1] "Mr"  "Mrs"

Another alternative with regexpr, regmatches, and lookbehind/lookahead:

regmatches(vec, regexpr("(?<=, ).+?(?=\\.)", vec, perl = TRUE))
# [1] "Mr"  "Mrs"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top