R data.table rbindlist

https://stackoverflow.com/questions/17837213

04-06-2022
|

Question

I am having trouble with a reasonable sized data.table containing 30 or so columns: (note I am using dummy values below)

Using rbindlist(list(dat, dat2)) to add a new data.table with same fields with another 50000 rows produces an incorrect new master data.table.

Is there a simple and fast solution to add new rows to a data.table where the column fields all match?

To simplify, I have created a dummy dataset.

master.df <- data.frame(id = letters[1:10], 
                    mpg = sample(c(20,22), 10, replace = TRUE),
                    cyl = sample(c(4,8), 10, replace = TRUE),
                    disp = sample(c(160,300), 10, replace = TRUE),
                    factor = sample(c(TRUE, FALSE), 10, replace = TRUE),   
                    hp = sample(c(20,22), 10, replace = TRUE))

newTable.df <- data.frame(id = letters[11:15], 
                        mpg = sample(c(20,22), 5, replace = TRUE),
                        cyl = sample(c(4,8), 5, replace = TRUE),
                        disp = sample(c(160,300), 5, replace = TRUE),
                        factor = sample(c(TRUE, FALSE), 10, replace = TRUE), 
                        hp = sample(c(20,22), 5, replace = TRUE))

library(data.table)

dat = as.data.table(master.df)
dat2 = as.data.table(newTable.df)

Using rbind(dat,dat2) outputs duplicate dat2. (expected should be total 15 rows)

I read forums for better solutions and something came up with rbindlist but that does not look like it does the trick either. Same output as rbind

Is there a fast solution that binds dat2 to dat without the duplication?

output for rbind and rbindlist

    id mpg cyl disp factor hp
 1:  a  22   8  300  FALSE 20
 2:  b  20   8  300   TRUE 20
 3:  c  20   8  160  FALSE 20
 4:  d  20   4  300   TRUE 22
 5:  e  22   4  160  FALSE 22
 6:  f  22   4  160   TRUE 22
 7:  g  20   8  160  FALSE 20
 8:  h  22   4  300  FALSE 20
 9:  i  22   4  160  FALSE 20
10:  j  22   8  160   TRUE 22
11:  k  22   8  160  FALSE 20
12:  l  22   8  160   TRUE 20
13:  m  20   8  300   TRUE 20
14:  n  22   4  300  FALSE 20
15:  o  20   8  160  FALSE 20
16:  k  22   8  160  FALSE 20
17:  l  22   8  160  FALSE 20
18:  m  20   8  300  FALSE 20
19:  n  22   4  300   TRUE 20
20:  o  20   8  160   TRUE 20

Solution

Try using unique:

 unique(rbind(dat1, dat2))

OTHER TIPS

Your problem is that when creating newTable.df you have the following line:

factor = sample(c(TRUE, FALSE), 10, replace = TRUE)

This causes the resulting table to have 10 rows (instead of 5 as you intended). Once you change this 10 into 5, the dat2 data.table will have 5 rows, and rbind(dat, dat2) will have 15 rows.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow