Different results applying merge in R over individual data frames and over one list of data frames

StackOverflow https://stackoverflow.com/questions/20820968

  •  22-09-2022
  •  | 
  •  

Question

Hi everybody I am working with a list of data frames in R and I want to merge them one by one. I found one solution is using Reduce() function with merge() but I don't get the same result when I merge one by one data frame. My list of data frames is global and It has the next structure (I include dput() version of my list in final part):

global
$a1
   ID Value Products z1
1 001     1        3  1
2 002     2        2  1
3 003     3        0  1
4 004     4        1  1
5 005     5        1  1
6 006     6        6  1
7 007     7        7  1
8 009     8        1  1
9 010     9        1  1

$a2
    ID Value Products z2
1  001     1        3  2
2  002     2        2  2
3  003     3        0  2
4  004     4        1  2
5  005     5        1  2
6  006     6        6  2
7  011    10        5  2
8  012    11        5  2
9  007     7        7  2
10 009     8        1  2
11 010     9        1  2

$a3
    ID Value Products z3
1  001     1        3  3
2  002     2        2  3
3  012    11        5  3
4  013    11        1  3
5  014    11        2  3
6  003     3        0  3
7  004     4        1  3
8  005     5        1  3
9  006     6        6  3
10 007     7        7  3
11 009     8        1  3
12 010     9        1  3
13 011    10        5  3

$a4
    ID Value Products z4
1  001     1        3  4
2  002     2        2  4
3  012    11        5  4
4  013    11        1  4
5  014    11        2  4
6  003     3        0  4
7  004     4        1  4
8  005     5        1  4
9  006     6        6  4
10 007     7        7  4
11 009     8        1  4
12 010     9        1  4
13 011    10        5  4
14 015    12        3  4
15 016    12        3  4

$a5
    ID Value Products z5
1  001     1        3  5
2  002     2        2  5
3  003     3        0  5
4  004     4        1  5
5  016    12        3  5
6  017    14        2  5
7  005     5        1  5
8  006     6        6  5
9  007     7        7  5
10 009     8        1  5
11 010     9        1  5
12 011    10        5  5
13 012    11        5  5
14 013    11        1  5
15 014    11        2  5
16 015    12        3  5
17 018    14        2  5

I am merging all data frames with their previous data frames in global and for this I used the next code to create a new list named listag:

listag=Reduce(function(x, y) merge(x,y[,c(1,4)],by=intersect(names(x)[1],names(y)[1]),all.x=TRUE),global,accumulate=TRUE)

I used the argument all.x=TRUE in merge() because I want to keep in each data frame their orginal number of rows (a1=9,a2=11,a3=13,a4=15,a5=17). After of this I separated global in individual data frames to check last code works fine and I found differences. To separate data frames I used this code:

list2env(global, envir=.GlobalEnv)

I got my five data frames. Now I am going to show what I want with data frames a4 and a5. First I used next code to merge a4 with a1,a2,a3 and a4:

Final41=merge(a4,a1[,c(1,4)],by=intersect(names(a4)[1],names(a1)[1]),all.x=TRUE)
Final42=merge(Final41,a2[,c(1,4)],by=intersect(names(Final41)[1],names(a2)[1]),all.x=TRUE)
Final43=merge(Final42,a3[,c(1,4)],by=intersect(names(Final42)[1],names(a3)[1]),all.x=TRUE)
Final4=merge(Final43,a4[,c(1,4)],by=intersect(names(Final43)[1],names(a4)[1]),all.x=TRUE)

The result of this code is:

Final4

    ID Value Products z4.x z1 z2 z3 z4.y
1  001     1        3    4  1  2  3    4
2  002     2        2    4  1  2  3    4
3  003     3        0    4  1  2  3    4
4  004     4        1    4  1  2  3    4
5  005     5        1    4  1  2  3    4
6  006     6        6    4  1  2  3    4
7  007     7        7    4  1  2  3    4
8  009     8        1    4  1  2  3    4
9  010     9        1    4  1  2  3    4
10 011    10        5    4 NA  2  3    4
11 012    11        5    4 NA  2  3    4
12 013    11        1    4 NA NA  3    4
13 014    11        2    4 NA NA  3    4
14 015    12        3    4 NA NA NA    4
15 016    12        3    4 NA NA NA    4

Where the argument all.x=TRUE is working fine because I keep the original number of observations in a4(15). When I extract the 4th element of listag I got this:

f4l=listag[[4]]
f4l

  ID  Value Products z1 z2 z3 z4
1 001     1        3  1  2  3  4
2 002     2        2  1  2  3  4
3 003     3        0  1  2  3  4
4 004     4        1  1  2  3  4
5 005     5        1  1  2  3  4
6 006     6        6  1  2  3  4
7 007     7        7  1  2  3  4
8 009     8        1  1  2  3  4
9 010     9        1  1  2  3  4

For merge() in Reduce() function I am considering also all.x=TRUE but I don't get the same result and the number of observations is wrong. I would like to get after applying the combination of Reduce() and merge() the result of Final4. It is the same for the rest of data frames of listag after applying Reduce() and merge() combined over global. I would like to get this result for each data frame in listag(in this case for 4th data frame it would be):

   ID  Value Products  z1 z2 z3  z4
1  001     1        3  1  2  3    4
2  002     2        2  1  2  3    4
3  003     3        0  1  2  3    4
4  004     4        1  1  2  3    4
5  005     5        1  1  2  3    4
6  006     6        6  1  2  3    4
7  007     7        7  1  2  3    4
8  009     8        1  1  2  3    4
9  010     9        1  1  2  3    4
10 011    10        5 NA  2  3    4
11 012    11        5 NA  2  3    4
12 013    11        1 NA NA  3    4
13 014    11        2 NA NA  3    4
14 015    12        3 NA NA NA    4
15 016    12        3 NA NA NA    4

I don't know what is wrong in my code when I combine Reduce() and merge(). I am considering all.x=TRUE equal when I make the merge one by one data frame. Could you help me with this. Maybe I have to add another argument in the combination of Reduce() and merge() to get my result or there is other way like use lapply or llply from plyr package over global. The dput() version of global is the next:

structure(list(a1 = structure(list(ID = c("001", "002", "003", 
"004", "005", "006", "007", "009", "010"), Value = c(1, 2, 3, 
4, 5, 6, 7, 8, 9), Products = c(3, 2, 0, 1, 1, 6, 7, 1, 1), z1 = c(1, 
1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("ID", "Value", "Products", 
"z1"), row.names = c(NA, 9L), class = "data.frame"), a2 = structure(list(
    ID = c("001", "002", "003", "004", "005", "006", "011", "012", 
    "007", "009", "010"), Value = c(1, 2, 3, 4, 5, 6, 10, 11, 
    7, 8, 9), Products = c(3, 2, 0, 1, 1, 6, 5, 5, 7, 1, 1), 
    z2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)), .Names = c("ID", 
"Value", "Products", "z2"), row.names = c(NA, 11L), class = "data.frame"), 
    a3 = structure(list(ID = c("001", "002", "012", "013", "014", 
    "003", "004", "005", "006", "007", "009", "010", "011"), 
        Value = c(1, 2, 11, 11, 11, 3, 4, 5, 6, 7, 8, 9, 10), 
        Products = c(3, 2, 5, 1, 2, 0, 1, 1, 6, 7, 1, 1, 5), 
        z3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), .Names = c("ID", 
    "Value", "Products", "z3"), row.names = c(NA, 13L), class = "data.frame"), 
    a4 = structure(list(ID = c("001", "002", "012", "013", "014", 
    "003", "004", "005", "006", "007", "009", "010", "011", "015", 
    "016"), Value = c(1, 2, 11, 11, 11, 3, 4, 5, 6, 7, 8, 9, 
    10, 12, 12), Products = c(3, 2, 5, 1, 2, 0, 1, 1, 6, 7, 1, 
    1, 5, 3, 3), z4 = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
    4, 4)), .Names = c("ID", "Value", "Products", "z4"), row.names = c(NA, 
    15L), class = "data.frame"), a5 = structure(list(ID = c("001", 
    "002", "003", "004", "016", "017", "005", "006", "007", "009", 
    "010", "011", "012", "013", "014", "015", "018"), Value = c(1, 
    2, 3, 4, 12, 14, 5, 6, 7, 8, 9, 10, 11, 11, 11, 12, 14), 
        Products = c(3, 2, 0, 1, 3, 2, 1, 6, 7, 1, 1, 5, 5, 1, 
        2, 3, 2), z5 = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
        5, 5, 5, 5, 5)), .Names = c("ID", "Value", "Products", 
    "z5"), row.names = c(NA, 17L), class = "data.frame")), .Names = c("a1", 
"a2", "a3", "a4", "a5")) 

Many thanks for your help.

Was it helpful?

Solution

Several things:

*First, it is normal that your Reduced merge and your manual merge* give different results since they are not performed in the same order. The Reduce processes 1:4, and for a reason I do not quite understand, in your manual merges your perform 4,1,2,3,4.

Second, the difference that you observe is that the a4 table has additional IDs, and they are lost in the Reduced merge, because you use all.x=TRUE, but the a4 table came as the "y" table. So you should use all=TRUE instead:

listag <- Reduce(function(x, y) merge(x, y[, c(1, 4)],
          by = intersect(names(x)[1], names(y)[1]), all = TRUE), global)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top