Question

I have a character vector

string <- "First line\nSecond line\nthird line\n\nFourth line\nFifth line"

which was created from the poem

1 First line
2 Second line
3 Third line

4 Fourth line
5 Fifth line

I want to substring the vector from the 3rd verse to the 5th verse or 3rd to the 5th line (the blank line is not counted and should not be counted). Each line except from the first one might start with \n or \n\n. I don't know the content of the lines (of course) and I don't know how many empty lines (\n\n) I have between the 3rd and the 5th line. I then want to get

substring <- "third line\n\nFourth line\nFifth line"

which can then be rendered as

3 Third line

4 Fourth line
5 Fifth line
Was it helpful?

Solution

Using strsplit we split the string into groups. Then remove everything up to the last \n in the first group leaving its last line and paste that together with the second group:

groups <- strsplit(string, "\n\n+")[[1]]
paste(sub(".*\n", "", groups[1]), groups[2], sep = "\n\n")

giving:

[1] "third line\n\nFourth line\nFifth line"

Note that the above always puts two \n between the last line of the first group and first line of the second group even if there were more originally. If its important to preserve the number of \n then extract out the separators, seps and from those choose the 1st that has more than 1 character. Use that in the final paste:

seps <- strsplit(string, "[^\n]+")[[1]]
sep <- seps[nchar(seps) > 1][1] # 1st multiple \n separator

groups <- strsplit(string, "\n\n+")[[1]]
paste(sub(".*\n", "", groups[1]), groups[2], sep = sep)

REVISED Added note and improved slightly.

OTHER TIPS

Ok, I added some more test and starred the lines i think should be included

1:-----  
    First line
    Second line
    third line (*)
    <blank>
    Fourth line (*)
    Fifth line (*)
2:-----
    <blank>
    <blank>
    aaaa
    bbbbb
    ccccc (*)
    dddddd (*)
    eeeeee (*)
    ffffff
    <blank>
3:-----
    11111
    <blank>
    222222
    <blank>
    333333 (*)
    <blank>
    4444444 (*)
    <blank>
    555555 (*)

If that's the case then i think this should find them all

tests<-c(
    "First line\nSecond line\nthird line\n\nFourth line\nFifth line",
    "\n\naaaa\nbbbbb\nccccc\ndddddd\neeeeee\nffffff\n",
    "11111\n\n222222\n\n333333\n\n4444444\n\n555555"
)
gsub("^\\n*[^\\n]+\\n+[^\\n]+\\n+([^\\n]+\\n+[^\\n]+\\n+[^\\n]+)[\\s\\S]*", "\\1", tests, perl=T)
#[1] "third line\n\nFourth line\nFifth line"
#[2] "ccccc\ndddddd\neeeeee"     
#[3] "333333\n\n4444444\n\n555555" 

You can gsub up to the end of the second line to get the third line to the end of the string.

> gsub('^.*Second line\n', '', string)
[1] "third line\n\nFourth line\nFifth line"

Or use strsplit in the same manner.

> strsplit(string, '^.*Second line\n')[[1]][2]
[1] "third line\n\nFourth line\nFifth line"

Additionally, readLines may do the trick as well.

> x <- readLines(textConnection(string))
> gg <- grep('third|fifth', x, ignore.case = TRUE)
> x[gg[1]:gg[2]]
[1] "third line"  ""            "Fourth line" "Fifth line"  
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top