Question

I would like to split a series of strings on the third white space from the right. The number of white spaces varies among string, but each string has at least three white spaces. Here are two example strings.

strings <- c('abca eagh   ijkl mnop', 'dd1 ss j, ll bb aa')

I would like:

[1] 'abca', 'eagh   ijkl mnop' 
[2] 'dd1 ss j,', 'll bb aa'

The closest I have been able to come is:

strsplit(strings, split = "(?<=\\S)(?=\\s(.*)\\s(.*)\\s(.*)$)", perl = TRUE)

which returns:

[[1]]
[1] "abca"         " eagh"        "   ijkl mnop"

[[2]]
[1] "dd1"       " ss"       " j,"       " ll bb aa"

I keep thinking the answer should be something like:

strsplit(strings, split = "(?<=\\S\\s(.*)\\s(.*)\\s(.*)$)(?=\\s(.*)\\s(.*)\\s(.*)$)", perl = TRUE)

However, that returns an error. Thank you for any advice. I prefer a solution in base, hopefully one that uses regular expressions.

Was it helpful?

Solution

Try the expression:

(?=(?>\\s\\S*){3}$)\\s

Edit: Use this expression if you want consecutive whitespace characters to be treated as 'one' whitespace:

(?=(?>\\s+\\S*){3}$)\\s

It's worth noting that the reason your expression was causing an error is most likely because most regex engines do not permit variable width lookbehinds. In your example that would be the * quantifier in the lookbehind breaking the rules.

Got it! Sorry I wasn't 100% on how the strsplit function worked. Try this:

strsplit(strings, split = "(?=(?>\\s+\\S*){3}$)\\s", perl = TRUE)

Here is an example output:

> strings <- c('abca eagh   ijkl mnop', 'dd1 ss j, ll bb aa')
> strsplit(strings, split = "(?=(?>\\s+\\S*){3}$)\\s", perl = TRUE)
[[1]]
[1] "abca"             "eagh   ijkl mnop"

[[2]]
[1] "dd1 ss j," "ll bb aa" 

OTHER TIPS

How about using the following regex: (\S*\s*\S*\s*\S*\s*)(.*)? See http://regex101.com/r/lI7aA9

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top