Question

Dear stackoverflow users,

Many people encounter situations in which they need to modify strings. I have seen many posts related to string modification. But, I have not come across solutions I am looking for. I believe my post would be useful for some other R users who will face similar challenges. I would like to seek some help from R users who are familiar with string modification.

I have been trying to modify a string like the following.

x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"

There are four individuals in this string. Family names are in capital letters. Three out of four family names stay in chunks with first names (e.g., HELLNERJohan). I want to separate family names and first names adding space (e.g., HELLNER Johan).

I think I need to state something like "Select sequences of uppercase letters, and add space between the last and second last uppercase letters, if there are lowercase letters following."

The following post is probably somewhat relevant, but I have not been successful in writing codes yet.

Splitting String based on letters case

Thank you very much for your generous support.

Was it helpful?

Solution

This works by finding and capturing two consecutive sub-patterns, the first consisting of one upper case letter (the end of a family name), and the next consisting of an upper then a lower-case letter (taken to indicate the start of a first name). Everywhere these two groups are found, they are captured and replaced by themselves with a space inserted between (the "\\1 \\2" in the call below).

x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
# "Marcus HELLNER Johan OLSSON Anders SOEDERGREN Daniel RICHARDSSON"

OTHER TIPS

If you want to separate the vector into a vector of names, this splits the string using a regular expression with zero-width lookbehind and lookahead assertions.

strsplit(x, split = "(?<=[[:upper:]])(?=[[:upper:]][[:lower:]])", 
  perl = TRUE)[[1]]
# [1] "Marcus HELLNER"     "Johan OLSSON"       "Anders SOEDERGREN" 
# [4] "Daniel RICHARDSSON"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top