Question

Let's say I have the following string:

s <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705"

I would like to recover the strings between ";" and "=" to get the following output:

[1] "MIMAT0027618"  "MIMAT0027618"  "hsa-miR-6859-5p"  "MI0022705"

Can I use strsplit() with more than one split element?

Was it helpful?

Solution

1) strsplit with matrix Try this:

> matrix(strsplit(s, "[;=]")[[1]], 2)[2,]
[1] "MIMAT0027618"    "MIMAT0027618"    "hsa-miR-6859-5p" "MI0022705"   

2) strsplit with gsub or this use of strsplit with gsub:

> strsplit(gsub("[^=;]+=", "", s), ";")[[1]]
[1] "MIMAT0027618"    "MIMAT0027618"    "hsa-miR-6859-5p" "MI0022705"     

3) strsplit with sub or this use of strsplit with sub:

> sub(".*=", "", strsplit(s, ";")[[1]])
[1] "MIMAT0027618"    "MIMAT0027618"    "hsa-miR-6859-5p" "MI0022705"   

4) strapplyc or this which extracts consecutive non-semicolons after equal signs:

> library(gsubfn)
> strapplyc(s, "=([^;]+)", simplify = unlist)
[1] "MIMAT0027618"    "MIMAT0027618"    "hsa-miR-6859-5p" "MI0022705"  

ADDED additional strplit solutions.

OTHER TIPS

I know this is an old question, but I found the usage of lookaround regular expressions quite elegant for this problem:

library(stringr)
your_string <- '/this/file/name.txt'
result <- str_extract(string = your_string, pattern = "(?<=/)[^/]*(?=\\.)")
result

In words,

  1. The (?<=...) part looks before the desired string for a... (in this case a forward slash).
  2. The [^/]* then looks for as many characters in a row that are not a forward slash (in this case name.txt).
  3. The (?=...) then looks after the desired string for a ... (in this case the special period character, which needs to be escaped as \\.).

This also works on dataframes:

library(dplyr)
strings <- c('/this/file/name1.txt', 'tis/other/file/name2.csv')
df <- as.data.frame(strings) %>% 
  mutate(name = str_extract(string = strings, pattern = "(?<=/)[^/]*(?=\\.)"))
# Optional
names <- df %>% pull(name)

Or, in your case:

your_string <- "ID=MIMAT0027618;Alias=MIMAT0027618;Name=hsa-miR-6859-5p;Derives_from=MI0022705" 
result <- str_extract(string = your_string, pattern = "(?<=;Alias=)[^;]*(?=;)") 
result # Outputs 'MIMAT0027618'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top