Question

I am trying to subset data, using names of work and test set

ws_data <- subset(data, grepl(paste0("v*[0-9]_",ws_names, collapse="|" ),
           rownames(data))==TRUE)

It seems to work ok, but for the rownames like

"(Difluoromethoxy)trifluoromethane"

are just skipped. Are parenthese allowed as legal names in R? How can I solve this problem not changing row names? Thanks in advance!

The example of data

64 | v0064_(Chloro)(trifluor)omethane | -51.5 | 510.9 | 104.5 | 11.2 |
65 | v0067_(Dichloro)difluoromethane | -81.0 | 233.0 | 121.0 | 16.1 |

Regular expressions

rownames(ts)[1]
[1] "Bromotrifluoromethane"

rownames(data)[1]
[1] "v0001_Bromotrifluoromethane"

grepl("v[0-9]*_Bromotrifluoromethane", rownames(data)[1])
[1] TRUE

grepl("v*[0-9]_Bromotrifluoromethane", rownames(data)[1])
[1] TRUE

Was it helpful?

Solution 2

I'm guessing the problem you're facing is the fact that the parentheses have a meaning in regular expressions. This post has a cure for that, which you can use to do something like this:

quotemeta <- function(x) gsub("([^A-Za-z_0-9])", "\\\\\\1", x)

data[grepl(paste0("^v[0-9]*_", quotemeta(ws_names), collapse="|"), rownames(data)), ]

OTHER TIPS

In general you can have rownames with characters like that in names and rownames, you just need to quote them when using them. I think the problem here is the subset function, it allows some unusual ways to specify the subset which makes some things easier, but others harder. It is trying to figure out what you mean by the rownames (rather than just take them as literal strings) and the parentheses are probably confusing that process.

Try something like:

data[ grepl( paste0("v*[0-9]_",ws_names, collapse="|" ), rownames(data)), ]

You may also be able to simplify this using %in% if you can construct the list of names.

Also see fortune(69), the ==TRUE is redundant and slightly less useful than adding 0 or multiplying by 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top