Frage

So, I have s really huge data frame, which has two columns of characters. the characters are some ID values separated by ";". So, I want to calculate the number of common ID values between these two columns. Here is an example:

   id.x                  id.y
1  123;145;156       143;156;234;165
2  134;156;187;675   132;145;156;187

so in this case, the first row has 1 common value, and the second row has two common values.

The table size is 60M records, and some of the strings may be more than a 1000 long. I tried to write the data to a text file, and do this analysis by python, but the file size is 30GB. Any idea to do this in R? (regex, apply, ..)

I can count the numbe rof common values by this command:

intersect(strsplit(df[1,"ind.x"], split=";")[[1]], strsplit(df[1,"ind.y"], split=";") [[1]])

Therefore, I wrote a function:

myfun <- function(x,y) {
   length(intersect(strsplit(x, split=";")[[1]], strsplit(y, split=";")[[1]]))
}

which works when I try it on a single call, but when I use it with mapply as below, it prints all the columns, but I only want the number in output:

> mapply(FUN=myfun, df[1:2,]$id.x, df[1:2,]$id.y)
123;145;156 134;156;187;675 
          1               2

So, why it prints the first column as well? What is wrong with my command?

War es hilfreich?

Lösung

Mapply returns an integer vector with name attributes.

y <- mapply(myfun, df$id.x, df$id.y)
str(y)
Named int [1:2] 1 2
- attr(*, "names")= chr [1:2] "123;145;156" "134;156;187;675"

Drop them with USE.NAMEs

mapply(myfun, df$id.x, df$id.y, USE.NAMES=FALSE)
[1] 1 2

And use an index and test the time on larger and larger sets of data

system.time(y <- mapply(myfun, df[1:1e5,]$id.x, df[1:1e5,]$id.y, USE.NAMES=FALSE))
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top