Pregunta

I have a character matrix filled with values that follow these general formats: A/-, A/B, I/A, /, A/, /A, -/B, A/B/C, A/-/C.

I need to clean this data set so all that remains are values that follow the format A/B, in other words, two single characters separated by a forward slash. Anything that contains a -, I, multiple forward slashes, a single forward slash with no letters, or a single forward slash with only one letter must be replaced with blanks "".

I have tried numerous iterations of gsub() to replace any values not fitting the proper format with "".

This is the closest I have found that makes sense to me, but it only gets rid of values containing -, I, multiple forward slashes, and a single forward slash (no surrounding letters). Data that remain are in the format A/B (the one I want to keep), A/, /B (the other ones that need to be replaced):

    data.matrix = as.matrix(data)
    data.matrix.clean = gsub("/./|^/.|./$|^/$|-|I", "", data.matrix)

Perhaps I should write this differently without separating each of my independent criteria with a |? From what I've read, the ^ is to signify the beginning of a string and the $ is to signify the end of a string. It seems to work in the ^/$ case, but not in the ^/. or ./$ case and I'm not sure why.

After I try something new, I check to see what format all forward slash containing values are in, using this code which seems to work fine.

slash = grep("/", data.matrix.clean)
slash.t = data.matrix.clean[slash]
table(slash.t)

Any help in better understanding symbols that can be used within gsub() to make this work properly would be greatly appreciated.

Thank you!

¿Fue útil?

Solución 2

You need the quantifier * (any number of) to replace the whole string:

data.matrix <- matrix(c("A/-", "A/B", "I/A", "/", "A/", "/A", 
                        "-/B", "A/B/C", "A/-/C"), ncol = 3)

     [,1]  [,2] [,3]   
[1,] "A/-" "/"  "-/B"  
[2,] "A/B" "A/" "A/B/C"
[3,] "I/A" "/A" "A/-/C"


sub(".*/.*/.*|^/.*|.*/$|^/$|.*-.*|.*I.*", "", data.matrix)

     [,1]  [,2] [,3]
[1,] ""    ""   ""  
[2,] "A/B" ""   ""  
[3,] ""    ""   ""  

Otros consejos

Just use grepl and replace the rest:

conforming = grepl('^(?!I)\\w/(?!I)\\w$', matrix, perl = TRUE)
matrix[! conforming] = ""

Literally, this reads:

The string starts with a character (except I), followed by a slash and a character (except I) and ends there.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top