Question

I have found similar problems to this here: Count the number of words in a string in R? and here Faster way to split a string and count characters using R? but I can't get either to work in my example. I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:

[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-] 
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]

I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):

hg19    2   224840068   224840089   -

But in the case of the fourth entry, I would like to pase this into two seperate locations. i.e

hg19:16:67000244-67000248,67000628-67000647:+]

becomes

hg19    16  67000244    67000248    +
hg19    16  67000628    67000647    +

(with all the associated data in the adjacent columns filled in from the original)

An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature. However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.

testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)

(or)

testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)

    table(testdat$multiple)
    1 
    4 

Using the example I have posted above, I would expect the output to be

testdat$multiple
0
0
0
1

Actually doing

grep -c

on the same data in the command line shows I have 10 entries containing ','.

Using the example I have posted above, I would expect the output to be

So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data. Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.

Was it helpful?

Solution

gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top