Question

Please help me for my small project.

Have a big-list of text elements. Each element should be split into a small-list of sentences. Each small-list should be 'saved' as one element into a new column of the initial big-list at the same position ('row') like the original text element.

The splitting criteria are "/$", "und/KON", "oder/KON". This should be kept at the head of the new small-list-element.

I've tried with regular expressions like "/$|und/KON|oder/KON" and manny combinations of escaping "$", "|", "/". Also i tried to change the parameter perl = TRUE, fixed = TRUE and FALSE. Every time i try noting happens. Seems that the | is not interpreted properly. What do you recommend to solve the problem?

library(stringr) # don't know if it's required

# Input list to be splitted at each
#      "/$", "und/KON", "oder/KON"
#      but should keep the expression at the start of the next list element
#      
#      Would be nice but not necessary: The small-list to be named after the ID in the first column

> r <- list(ID=c(01, 02, 03),
            elements=c("This should become my first small-list :/$. the first element ,/$, the second element ,/$, and the third element ./$.",
                       "This should become my second small-list :/$. Element eins und/KON Element zwei oder/KON Element drei ./$.",
                       "This should become my third small-list :/$. Element Alpha und/KON Element Beta oder/KON Element Gamma ./$.")

# Would look something like 
r$small_lists <- sapply(r$elements ,function(x) as.list(strsplit(x,"/$|und/KON"|oder/KON", fixed=TRUE)))
> r$small_lists

$01
[1] "This should become my first small-list "
[2] ":/$. the first element "
[3] ",/$, the second element "
[4] ",/$, and the third element "
[5] "./$."

$02 
[1] "This should become my second small-list "
[2] ":/$. Element eins "
[3] "und/KON Element zwei "
[4] "oder/KON Element drei"
[5] "./$."

$03
[1] "This should become my third small-list "
[2] ":/$. Element Alpha "
[3] "und/KON Element Beta "
[4] "oder/KON Element Gamma "
[5] "./$."

> class(r)
[1] "list"
> class(r$small_lists)
[1] "list"
Was it helpful?

Solution

You actually have more patterns to split on than you indicate, if that's the output you desire. Note that my patterns are different from yours. All special characters have been escaped with \\.

To keep things manageable, I would create a separate vector of the patterns that you want to split on, paste them together in a master pattern, search for them and prepend them by some string you know doesn't occur in your text, and split on that.

Here are the "patterns" that I've identified:

Pattern <- c(":/\\$", ",/\\$", "\\./\\$",
             "und/KON", "oder/KON")

We can paste these patterns together to get the master pattern. sep on the interior paste is the pipe symbol for matching different patterns. The whole pattern is put within brackets (( and )) so that we can reference it later.

Pattern <- paste("(", paste(Pattern, collapse = "|"), ")", sep = "")

We can now use gsub to add a "prefix" to the pattern (that's what the \\1 refers to). We need that prefix because you want to retain the mentioned expression.

## Insert some text pattern you know doesn't occur in your text
## Here, I've prepended the matched patterns with "^&*"
## You now have something on which you can split
strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE)
# [[1]]
# [1] "This should become my first small-list "
# [2] ":/$. the first element "                
# [3] ",/$, the second element "               
# [4] ",/$, and the third element "            
# [5] "./$."                                   
# 
# [[2]]
# [1] "This should become my second small-list "
# [2] ":/$. Element eins "                      
# [3] "und/KON Element zwei "                   
# [4] "oder/KON Element drei "                  
# [5] "./$."                                    
# 
# [[3]]
# [1] "This should become my third small-list "
# [2] ":/$. Element Alpha "                    
# [3] "und/KON Element Beta "                  
# [4] "oder/KON Element Gamma "                
# [5] "./$." 

Continuing from above, to get the named list you describe:

out <- strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE)
setNames(lapply(out, `[`, -1), lapply(out, `[`, 1))
# $`This should become my first small-list `
# [1] ":/$. the first element "    
# [2] ",/$, the second element "   
# [3] ",/$, and the third element "
# [4] "./$."                       
# 
# $`This should become my second small-list `
# [1] ":/$. Element eins "    
# [2] "und/KON Element zwei " 
# [3] "oder/KON Element drei "
# [4] "./$."                  
# 
# $`This should become my third small-list `
# [1] ":/$. Element Alpha "    
# [2] "und/KON Element Beta "  
# [3] "oder/KON Element Gamma "
# [4] "./$." 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top