So, I have s really huge data frame, which has two columns of characters. the characters are some ID values separated by ";". So, I want to calculate the number of common ID values between these two columns. Here is an example:
id.x id.y
1 123;145;156 143;156;234;165
2 134;156;187;675 132;145;156;187
so in this case, the first row has 1 common value, and the second row has two common values.
The table size is 60M records, and some of the strings may be more than a 1000 long. I tried to write the data to a text file, and do this analysis by python, but the file size is 30GB. Any idea to do this in R? (regex, apply, ..)
I can count the numbe rof common values by this command:
intersect(strsplit(df[1,"ind.x"], split=";")[[1]], strsplit(df[1,"ind.y"], split=";") [[1]])
Therefore, I wrote a function:
myfun <- function(x,y) {
length(intersect(strsplit(x, split=";")[[1]], strsplit(y, split=";")[[1]]))
}
which works when I try it on a single call, but when I use it with mapply as below, it prints all the columns, but I only want the number in output:
> mapply(FUN=myfun, df[1:2,]$id.x, df[1:2,]$id.y)
123;145;156 134;156;187;675
1 2
So, why it prints the first column as well? What is wrong with my command?