Question

Like many I am new to R. I have a large data set (500M+ rows) which I have fread into a data.table logStats which has data like the following :

 head(logStats,15)

                   time   pid   mean
 1: 2014-03-10 00:00:00   998 3.570000
 2: 2014-03-10 00:00:00   11 4.090000
 3: 2014-03-10 00:00:00   345 3.380000
 4: 2014-03-10 00:05:00   998 4.866667
 5: 2014-03-10 00:05:00   11 3.677778
 6: 2014-03-10 00:05:00   345 4.487500
 7: 2014-03-10 00:10:00   345 4.833333
 8: 2014-03-10 00:10:00   998 4.333333
 9: 2014-03-10 00:10:00   11 6.977778
10: 2014-03-10 00:15:00   345 3.900000
11: 2014-03-10 00:15:00   998 3.200000
12: 2014-03-10 00:15:00   11 6.030000
13: 2014-03-10 00:20:00   998 4.550000
14: 2014-03-10 00:20:00   11 4.030000
15: 2014-03-10 00:20:00   345 6.060000 

There is a second very small data.table (360 rows) which has two columns that decodes a 'pid' value into something a bit more friendly to read. The 'pid' value can be either numerical or a character.

For Example:

pidLookupTable<-data.table(pid=c(998,11,345),pidName=c("Apple","Bannana","Cinnamon"))

which produces :

   pid  pidName
1: 998    Apple
2:  11  Bannana
3: 345 Cinnamon

I want an expression to be able to add a column to data.table logStats which has the pidName for that row pid.

I should get something like :

                   time pid     mean pidNames
 1: 2014-03-10 00:00:00   998 3.570000 Apple
 2: 2014-03-10 00:00:00   11 4.090000 Banana
 3: 2014-03-10 00:00:00   345 3.380000 Cinnamon
 4: 2014-03-10 00:05:00   998 4.866667 Apple
 5: 2014-03-10 00:05:00   11 3.677778 Banana
 6: 2014-03-10 00:05:00   345 4.487500 Cinnamon
 7: 2014-03-10 00:10:00   345 4.833333 Cinnamon
 8: 2014-03-10 00:10:00   998 4.333333 Apple
 9: 2014-03-10 00:10:00   11 6.977778 Banana
10: 2014-03-10 00:15:00   345 3.900000 Cinnamon
11: 2014-03-10 00:15:00   998 3.200000 Apple
12: 2014-03-10 00:15:00   11 6.030000 Banana
13: 2014-03-10 00:20:00   998 4.550000 Apple
14: 2014-03-10 00:20:00   11 4.030000 Banana
15: 2014-03-10 00:20:00   345 6.060000  Cinnamon

I wrote a function :

pidNameLookup<-function(x) { 
  return(pidLookupTable[pidLookupTable$pid==x,name]) 
}

and then ran:

logStats[,pidName:=pidNameLookup(pid)]

But this only converts the first 3 puts NA for the rest of the values :

   logStats[1:1000]
               date     time pid value           timestamp mean  pidName
      1: 10-03-2014 00:00:12 998   5.5 2014-03-10 00:00:12 3.57    Apple
      2: 10-03-2014 00:00:17  11   2.1 2014-03-10 00:00:17 4.09  Bannana
      3: 10-03-2014 00:00:22 345   5.7 2014-03-10 00:00:22 3.38 Cinnamon
      4: 10-03-2014 00:00:47 998   1.0 2014-03-10 00:00:47 3.57       NA
      5: 10-03-2014 00:00:55  11   0.3 2014-03-10 00:00:55 4.09       NA
      ---                                                                
      996: 10-03-2014 02:49:37 345   0.7 2014-03-10 02:49:37 5.30       NA
      997: 10-03-2014 02:50:01 998   9.9 2014-03-10 02:50:01 5.30       NA
      998: 10-03-2014 02:50:08  11   7.0 2014-03-10 02:50:08 7.00       NA
      999: 10-03-2014 02:50:18 345   2.4 2014-03-10 02:50:18 2.40       NA
     1000: 10-03-2014 02:50:48 998   0.7 2014-03-10 02:50:48 5.30       NA 

and gives me the warning message :

Warning message:
In pidLookupTable$pid == x 
  longer object length is not a multiple of shorter object length

The warning message and incorrect result means that I am doing something completely wrong.

Help!! This is driving me mental

Was it helpful?

Solution

I suggest you look at the introduction vignette for data.table (vignette("datatable-intro")), since this is something data.table is explicitly built for.

This will give you exactly what you want, and should be much, much faster:

setkey(logStats, "pid")
setkey(pidLookupTable, "pid")
logStats[pidLookupTable]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top