Why does strsplit use positive lookahead and lookbehind assertion matches differently?

Question 1

I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit:

The algorithm applied to each input string is
repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}
Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.

The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:

FF <- "(?=funky)"
testString <- "take me to funky town"

gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE

strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"

What happens is that the lonely lookahead (?=funky) matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.

Now the remaining string is funky town, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strspliting with an empty regex (when argument split=""). After this the remaining string is unky town, which is returned as the last split since there's no match.

Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.

Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit algorithm is documented, I belive this does not meet the definition of a bug.

Question 2

Based on Theodore Lytras' careful explication of substr()'s behavior, a reasonably clean workaround is to prefix the to-be-matched lookahead assertion with a positive lookbehind assertion that matches any single character:

testString <- "take me to funky town"
FF2 <- "(?<=.)(?=funky)"
strsplit(testString, FF2, perl=TRUE)
# [[1]]
# [1] "take me to " "funky town"

Question 3

Looks like a bug to me. This doesn't appear to just be related to spaces, specifically, but rather any lonely lookahead (positive or negative):

FF <- "(?=funky)"
testString <- "take me to funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f"           "unky town"  

FF <- "(?=funky)"
testString <- "funky take me to funky funky town"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "f"                "unky take me to " "f"                "unky "           
# [5] "f"                "unky town"       


FF <- "(?!y)"
testString <- "xxxyxxxxxxx"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "xxx"       "y"       "xxxxxxx"

Seems to work fine if given something to capture along with the zero-width assertion, such as:

FF <- " (?=XX )"
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    "XX text"

FF <- "(?= XX ) "
testString <- "text XX text"
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "text"    "XX text"

Perhaps something like that might function as a workaround.