With data.table, return between certain characters into a new column

https://stackoverflow.com/questions/23319398

10-07-2023
|

Question

I have a feeling this might be a simple question, but I've searched through SO for a bit now and found many interesting related Q/A, I'm still stumped.

Here's what I need to learn (in honesty, I'm playing with the kaggle Titanic dataset, but I want to use data.table)...

Let's say you have the following data.table:

dt <- data.table(name=c("Johnston, Mr. Bob", "Stone, Mrs. Mary", "Hasberg, Mr. Jason"))

I want my output to be JUST the titles "Mr.", "Mrs.", and "Mr." -- heck we can leave out the period as well.

I've been playing around (all night) and discovered that using regular expressions might hold the answer, but I've only been able to get that to work on a single string, not with the whole data.table.

For example,

substr(dt$name[1], gregexpr(",.", dt$name[1]), gregexpr("[.]", dt$name[1]))

Returns:

[1] ", Mr."

Which is cool, and I can do some further processing to get rid of the ", " and ".", but, the optimist(/optimizer) in me feels that that's ugly, gross, and inefficent.

Besides, even if I wanted to settle on that, (it pains me to admit) I don't know how to apply that into the J of data.table....

So, how do I add a column to dt called "Title", that contains:

[1] "Mr"
[2] "Mrs"
[3] "Mr"

I firmly believe that if I'm able to use regular expressions to select and extract data within a data.table that I will probably use this 100x a day. So thank you in advance for helping me figure out this pivotal technique.

PS. I'm an excel refugee, in excel I would just do this:

=mid(data, find(", ", data), find(".", data))

Solution

Umm.. I may have figured it out:

dt[, Title:=sub(".*?, (.*?)[.].*", "\\1", name)]

But I'm going to leave this here in case anyone else needs help, or perhaps there's an even better way of doing this!

OTHER TIPS

You can use the stringr package

library(stringr)
str_extract(dt$name, "M.+\\.")

[1] "Mr."  "Mrs." "Mr."

Different variations on the regular expression will let you extract other titles, like Dr., Master, or Reverend which may also be of interest to you.

To get all characters between "," and "." (inclusive) you can use

str_extract(dt$name, ",.+\\.")

and then remove the first and last characters of the result with str_sub (also from stringr package).

But as I think about it more, I might use grepl to create indicator variables for all the different titles that are in the Titanic dataset. For example

dr_ind <- grepl("Dr|Doctor", dt$name)
titled_ind <- grepl("Count|Countess|Baron", dt$name)

etc.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow