Use strsplit starting at end of string

https://stackoverflow.com/questions/22565721

r
strsplit

18-06-2023
|

Pregunta

I've currently been using code to split names of individual samples, change part of the sample name, and then rebind the strings together. The code works well when all names are the same length (ie: names are 8 characters long and it always splits after the first 4 characters), but when the names are different lengths, the code is no longer effective.

Essentially, individual names are 7 or 8 characters. The last 4 characters are what's important.
Example with 8 characters: Samp003A
Example with 7 characters: Sam003A

Is there a way to continue using strsplit to separate my names, but start from the end of the string rather the beginning, to keep the last 4 characters (003A)?

Current code:

> RowList <- as.list(rownames(df1))    
> RowListRes <- strsplit(as.character(RowList), "(?<=.{4})", perl = TRUE)    
> RowListRes.df <- do.call(rbind, RowListRes)    
> RowListRes.df[,1] <- "LY3D"    
> dfnames <- apply(RowListRes.df, 1, paste, collapse="")    
> rownames(df1) <- dfnames

It's line 2 that I'm trying hard to edit, so that I can split according to the last 4 characters.

Any help would be greatly appreciated!

Solución

It looks like you're a bit mixed up about how to use look-around assertions. The pattern you're using, "(?<=.{4})", is a look-behind assertion that says "find me all inter-character spaces that are preceded by four characters of any kind", which is not what you really want.

The pattern you actually want, "(?=.{4}$)", is a look-ahead assertion that finds the single inter-character space that is followed by four characters of any kind followed by the end of the string.

There is, unfortunately, an unpleasant twist. For reasons discussed in the answers to this question, strsplit() interacts oddly with look-ahead assertions; as a result, the pattern you'll actually need is "(?<=.)(?=.{4}$)". Here's what that looks like in action:

x <- c("Samp003A", "Sam003A")
strsplit(x, split="(?<=.)(?=.{4}$)", perl=T)
# [[1]]
# [1] "Samp" "003A"
# 
# [[2]]
# [1] "Sam"  "003A"

If all you really want are the final four characters of each entry, maybe just use substr(), like this:

x <- c("Samp003A", "Sam003A")
substr(x, start=nchar(x)-3, stop=nchar(x))
# [1] "003A" "003A"

Otros consejos

Wouldn't a substring from the end be simpler?

stringr::str_sub(as.character(RowList), -4)

or stringr::str_sub(as.character(RowList), -4, -2) to get just the numbers?

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow