In R, erase sections of string using data frame data

https://stackoverflow.com/questions/20916546

24-09-2022
|

Pregunta

I have a data frame that looks like:

'data.frame':   81 obs. of  2 variables:
$ start: int  232 10697 10965 12279 15647 16897 17033 17612 17719 17983 ...
$ end  : int  243 10702 10970 12284 15652 16902 17038 17617 17724 17988 ...

I have a string that has content I want to erase at those start/end offset pairs. So, whatever is between byte offset 232 and 234, I want to "erase it" and collapse the space. I figured out that I want to process the string backwards, so that if I'm modifying it at the end, the offsets closer to the beginning are still valid. The code I have so far is:

for (i in nrow(cutpoints):1) {
   row = cutpoints[i,]
   substr(sc, row$start, row$end) <- " "
}

But when I print out sc afterwards, it's only removed the first character of every substring that I wanted removed. Does anyone have any idea as to what I'm doing wrong? Furthermore, can this be vectorized?

UPDATE - I tried using stringr's str_sub:

> hw <- "Hadley Wickham"
> cuts <- data.frame(start=c(1,8), end=c(6,14))
> str_sub(hw, rev(cuts$start), rev(cuts$end)) <- " "
> hw
[1] "Hadley  "  "  Wickham"

So, clearly I don't understand what I'm doing with string processing in R.

Solución

It is easier to work with vectors of ranges to keep instead of ranges to cut. This is pretty easy to do by flipping the starts and ends and adjusting slightly:

hw <- "Hadley WickhamPLUSENDING"
cuts <- data.frame(start=c(1,8), end=c(6,14))
keeps <- data.frame(start=c(1, cuts$end+1), end=c(cuts$start-1, nchar(hw)))
keeps
#   start end
# 1     1   0
# 2     7   7
# 3    15  24

Substrings that start after they end will simply return no characters, so they are not an issue with our method.

You can use apply to vectorize the operation of keeping everything between each start/end pair:

pieces <- apply(keeps, 1, function(x) substr(hw, x[1], x[2]))
pieces
# [1] ""           " "          "PLUSENDING"
paste(pieces, collapse="")
# [1] " PLUSENDING"

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow