Domanda

I am in a circumstance where I need to merge two data frames together that each contain one observation about a research subject. Unfortunately, the data capture system allowed the end-user to enter some variables on two screens (for instance, the gender was captured at two timepoints, and should not change). There are no database-side checks to confirm that the data is consistent between screens, so we are checking in the post-processing.

What I would like to do is use the built-in R merge() function to merge the data frames, with the all=TRUE option so that I get two rows where the shared variables do not match, and then to have a single column in the resultant data frame that tells me the source of the row (either from X or Y in the merge). As near as I can tell, there's nothing like that in the merge() function, so I am trying to write my own wrapper for merge() that will do this.

Example:

example_df1 <- data.frame(subject_id=c(101,102,103,104,105),
                          gender=c("M","F","M","M","F"),
                          weight=c(120,130,110,114,144),
                          score=c(10,12,11,13,11))

example_df2 <- data.frame(subject_id=c(101,102,103,104,105),
                          gender=c("M","M","M","M","F"),
                          weight=c(120,130,110,117,144),
                          site1=c(13,18,23,12,4),
                          site2=c(3,7,8,11,0),
                          site3=c(31,28,12,29,40))

merge(x=example_df1,y=example_df2,all=TRUE)

  subject_id gender weight score site1 site2 site3
1        101      M    120    10    13     3    31
2        102      F    130    12    NA    NA    NA
3        102      M    130    NA    18     7    28
4        103      M    110    11    23     8    12
5        104      M    114    13    NA    NA    NA
6        104      M    117    NA    12    11    29
7        105      F    144    11     4     0    40

Desired output:

  subject_id gender weight score site1 site2 site3 rowsource
1        101      M    120    10    13     3    31   both
2        102      F    130    12    NA    NA    NA    x
3        102      M    130    NA    18     7    28    y
4        103      M    110    11    23     8    12   both
5        104      M    114    13    NA    NA    NA    x
6        104      M    117    NA    12    11    29    y
7        105      F    144    11     4     0    40   both

I need to implement the solution in base R without any special packages if at all possible due to the regulatory environment surrounding the project. My initial thought is to try to use intersect to find the common variables between both example_df1 and example_df2, and then to somehow compare each row of the merge result (within those common variables) against both example_df1 and example_df2 to figure out the source of the row. That seems really unwieldy, so I'd appreciate suggestions on how to improve the efficiency of this kind of task. Thanks!

EDITED TO ADD: If R always consistently puts the X row above the Y row in merges of this type, I suppose that could work too, but I think I'd feel better about something more stable than that.

È stato utile?

Soluzione

I would just add another column before merging to make life easier:

example_df1$source <- "X"
example_df2$source <- "Y"
Merged <- merge(x = example_df1, y = example_df2,
                all = TRUE, by = c("subject_id", "gender", "weight"))
Merged$rowSource <- apply(Merged[c("source.x", "source.y")], 1, 
                          function(x) paste(na.omit(x), collapse = ""))
Merged
#   subject_id gender weight score source.x site1 site2 site3 source.y rowSource
# 1        101      M    120    10        X    13     3    31        Y        XY
# 2        102      F    130    12        X    NA    NA    NA     <NA>         X
# 3        102      M    130    NA     <NA>    18     7    28        Y         Y
# 4        103      M    110    11        X    23     8    12        Y        XY
# 5        104      M    114    13        X    NA    NA    NA     <NA>         X
# 6        104      M    117    NA     <NA>    12    11    29        Y         Y
# 7        105      F    144    11        X     4     0    40        Y        XY

From there, it should be easy to change "XY" to "both" if that is what you prefer in your output, and you can then drop the "source.x" and "source.y" columns....

Altri suggerimenti

this does it all in one merging step and does not modify the original data.frames

mm<-transform(merge(
    x=cbind(example_df1,source="x"),
    y=cbind(example_df2,source="y"),
    all=TRUE, by=intersect(names(example_df1), names(example_df2))),
    source=ifelse(!is.na(source.x) & !is.na(source.y), "both", 
        ifelse(!is.na(source.x), "x", "y")),
    source.x=NULL,
    source.y=NULL
)

Thanks again for the answers. Once I saw the solution of just using cbind() to attach the source variable to the data frame, it was easy. I wrote a simple function that does it, which I'm sharing here.

merge_with_source <- function(x,y,name.x="X",name.y="Y") {

    # Find the variables that the two data frames have in common
    merge.names <- intersect(names(x),names(y))

    # Next, attach a column to each data frame with the chosen name
    x.df <- cbind(x,datsrc=name.x)
    y.df <- cbind(y,datsrc=name.y)

    # Create a merged data frame on the common names
    merged.df <- merge(x=x.df,
                       y=y.df,
                       all=TRUE,
                       by=merge.names)

    # Eliminate NAs from the data source column
    merged.df[is.na(merged.df$datsrc.x),"datsrc.x"] <- ""
    merged.df[is.na(merged.df$datsrc.y),"datsrc.y"] <- ""

    # Paste the data source columns together to get a single variable
    # Then, note those that are "Both" by replacing the mangled name
    merged.df$datsrc <- paste(merged.df$datsrc.x,merged.df$datsrc.y,sep="")
    merged.df[merged.df$datsrc==paste(name.x,name.y,sep=""),"datsrc"] <- "Both"

    # Remove the data frame-specific variables
    merged.df$datsrc.x <- rm()
    merged.df$datsrc.y <- rm()

    return(merged.df)
}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top