Merge / match two variables with one group of variables from another dataframe

https://stackoverflow.com/questions/22789336

25-06-2023
|

Вопрос

I have two data.frames df.1 and df.2 that I would merge or otherwise select data from to create a new data.frame. df.1 contains information about each individual (ID), sampling event (Event), Site and sample number (Sample). The tricky part for me is that Site and the corresponding Sample for each ID-Event pairing is different. For example, F3-3 has Site "plum" for Sample "1" and M6-3 has Site "pear" for Sample "1".

data.frame df.1

df.2 has Sample1 and Sample2 which corresponds to the Sample information in df.1 by way of the ID-Event pairing.

data.frame df.2

I'd like to match/merge the information between these two data.frames. Essentially, get the "word" from Site in df.1 that matches the Sample number. An example (df.3) is below.

data.frame df.3

Each ID-Event pairing will only have one Site and corresponding Sample (e.g. "Apple" will correspond to "1" not to "1" and "4"). I know I could use merge if I was only matching, for example, Sample1 or Sample2 I am not sure how to do this with both to populate Site1 and Site2 with the correctly matched word.

df.1 <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("F1", 
"F3", "M6"), class = "factor"), Sex = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), .Label = c("F", "M"), class = "factor"), Event = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 
4L, 4L, 4L, 4L), Site = structure(c(1L, 3L, 9L, 7L, 8L, 10L, 
2L, 6L, 4L, 5L, 1L, 9L, 7L, 8L, 10L, 5L, 10L, 2L, 6L, 4L, 5L, 
1L, 9L, 2L, 6L, 4L, 5L, 1L, 8L, 3L, 10L, 4L, 2L, 6L, 4L, 5L, 
1L), .Label = c("Apple", "Banana", "Grape", "Guava", "Kiwi", 
"Mango", "Orange", "Peach", "Pear", "Plum"), class = "factor"), 
    Sample = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 
    3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 
    6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L)), .Names = c("ID", 
"Sex", "Event", "Site", "Sample"), class = "data.frame", row.names = c(NA, 
-37L))
 #
 df.2 <- structure(list(Sample1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 2L, 2L, 2L), Sample2 = c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
 3L, 4L, 5L), V1 = c(0.12, 0.497, 0.715, 0, 0.001, 0, 0.829, 0, 
 0, 0.001, 0, 0.829), V2 = c(0.107, 0.273, 0.595, 0, 0.004, 0, 
 0.547, 0.001, 0.001, 0.107, 0.273, 0.595), ID = structure(c(1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F1", 
 "M6"), class = "factor"), Sex = structure(c(1L, 1L, 1L, 1L, 1L, 
  1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F", "M"), class = "factor"), 
  Event = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L)), .Names = c("Sample1", 
  "Sample2", "V1", "V2", "ID", "Sex", "Event"), class = "data.frame", row.names = c(NA, 
    -12L))
 #
 df.3 <- structure(list(Sample1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 2L, 2L, 2L), Sample2 = c(2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
 3L, 4L, 5L), V1 = c(0.12, 0.497, 0.715, 0, 0.001, 0, 0.829, 0, 
 0, 0.001, 0, 0.829), V2 = c(0.107, 0.273, 0.595, 0, 0.004, 0, 
 0.547, 0.001, 0.001, 0.107, 0.273, 0.595), Site1 = structure(c(1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Apple", 
 "Banana"), class = "factor"), Site2 = structure(c(2L, 8L, 6L, 
 7L, 9L, 1L, 5L, 3L, 4L, 5L, 3L, 4L), .Label = c("Banana", "Grape", 
 "Guava", "Kiwi", "Mango", "Orange", "Peach", "Pear", "Plum"), class = "factor"), 
 ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
 2L, 2L), .Label = c("F1", "M6"), class = "factor"), Sex = structure(c(1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("F", 
 "M"), class = "factor"), Event = c(1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 3L, 3L, 3L)), .Names = c("Sample1", "Sample2", 
 "V1", "V2", "Site1", "Site2", "ID", "Sex", "Event"), class = "data.frame", row.names =   c(NA, -12L))

Решение

Two merges should do it:

first <- merge(df.2, unique(df.1[,3:5]), by.x=c("Sample1","Event"), by.y=c("Sample","Event"), all.x=TRUE)
second <- merge(first, unique(df.1[,3:5]),by.x=c("Sample2","Event"), by.y=c("Sample","Event"), all.x=TRUE)

print(second)
   Sample2 Event Sample1    V1    V2 ID Sex Site.x Site.y
1       10     1       1 0.000 0.001 F1   F  Apple   Kiwi
2        2     1       1 0.120 0.107 F1   F  Apple  Grape
3        3     1       1 0.497 0.273 F1   F  Apple   Pear
4        3     3       2 0.001 0.107 M6   M Banana  Mango
5        4     1       1 0.715 0.595 F1   F  Apple Orange
6        4     3       2 0.000 0.273 M6   M Banana  Guava
7        5     1       1 0.000 0.000 F1   F  Apple  Peach
8        5     3       2 0.829 0.595 M6   M Banana   Kiwi
9        6     1       1 0.001 0.004 F1   F  Apple   Plum
10       7     1       1 0.000 0.000 F1   F  Apple Banana
11       8     1       1 0.829 0.547 F1   F  Apple  Mango
12       9     1       1 0.000 0.001 F1   F  Apple  Guava

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow