Question

Would like to split a vector of character elements text in sentences. There are more then one pattern of splitting criteria ("and/ERT", "/$"). Also there are exceptions(:/$., and/ERT then, ./$. Smiley) from the patterns.

The try: Match the cases where the split should be. Insert an unusual pattern ("^&*") at that place. strsplit the specific pattern

Problem: I don't know how to handle properly exceptions. There are explicit cases where the unusual pattern ("^&*") should be eliminated and the original text restored before running strsplit.

Code:

text <- c("This are faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"This are the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!",
"Like above the same faulty propositions one and/ERT two ,/$, which I want to split ./$. There are cases where I explicitly want and/ERT some where I don't want to split ./$. For example :/$. when there is an and/ERT then I don't want to split ./$. This is also one case where I dont't want to split ./$. Smiley !/$. Thank you ./$!")

patternSplit <- c("and/ERT", "/\\$") # The class of split-cases is much larger then in this example. Therefore it is not possible to adress them explicitly.
patternSplit <- paste("(", paste(patternSplit, collapse = "|"), ")", sep = "")

exceptionsSplit <- c("\\:/\\$\\.", "and/ERT then", "\\./\\$\\. Smiley")
exceptionsSplit <- paste("(", paste(exceptionsSplit, collapse = "|"), ")", sep = "")

# If you don't have exceptions, it works here. Unfortunately it splits "*$/*" into "*" and "$/*". Would be convenient to avoid this. See example "ideal" split below.
textsplitted <- strsplit(gsub(patternSplit, "^&*\\1", text), "^&*", fixed = TRUE) # 

# Ideal split:
textsplitted
> textsplitted
[[1]]
 [1] "This are faulty propositions one and/ERT" 
 [2] "two ,/$," 
 [3] "which I want to split ./$."
 [4] "There are cases where I explicitly want and/ERT" 
 [5] "some where I don't want to split ./$." 
 [6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
 [7] "This is also one case where I dont't want to split ./$. Smiley !/$." 
 [8] "Thank you ./$!"

[[2]]
 [1] "This are the same faulty propositions one and/ERT 
 [2] "two ,/$,"
#...      

# This try doesen't work!
text <- gsub(patternSplit, "^&*\\1", text)
text <- gsub(exceptionsSplit, "[original text without "^&*"]", text)
textsplitted <- strsplit(text, "^&*", fixed = TRUE)
Was it helpful?

Solution

I think you can use this expression to attain the splits you want. As strsplit uses up the characters it splits on you will have to split on the spaces following the things to match for/not to match for (which is what you have in the desired output in your OP):

strsplit( text[[1]] , "(?<=and/ERT)\\s(?!then)|(?<=/\\$[[:punct:]])(?<!:/\\$[[:punct:]])\\s(?!Smiley)"  , perl = TRUE )
#[[1]]
#[1] "This are faulty propositions one and/ERT"                                 
#[2] "two ,/$,"                                                                 
#[3] "which I want to split ./$."                                               
#[4] "There are cases where I explicitly want and/ERT"                          
#[5] "some where I don't want to split ./$."                                    
#[6] "For example :/$. when there is an and/ERT then I don't want to split ./$."
#[7] "This is also one case where I dont't want to split ./$. Smiley !/$."      
#[8] "Thank you ./$!" 

Explanation

  • (?<=and/ERT)\\s - split on a space, \\s that IS preceded, (?<=...) by "and/ERT"
  • (?!then) - BUT only if that space is NOT followed, (?!...) by "then"
  • | - OR operator to chain the next expression
  • (?<=/\\$[[:punct:]]) - positive look-behind assertion for "/$" followed by any letter of punctuation
  • (?<!:/\\$[[:punct:]])\\s(?!Smiley) - match a space that is NOT preceded by ":/$"[[:punct:]] (but according to the previous point IS preceded by "/$[[:punct:]]" but NOT followed, (?!...) by "Smiley"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top