I am in a circumstance where I need to merge two data frames together that each contain one observation about a research subject. Unfortunately, the data capture system allowed the end-user to enter some variables on two screens (for instance, the gender was captured at two timepoints, and should not change). There are no database-side checks to confirm that the data is consistent between screens, so we are checking in the post-processing.
What I would like to do is use the built-in R merge()
function to merge the data frames, with the all=TRUE
option so that I get two rows where the shared variables do not match, and then to have a single column in the resultant data frame that tells me the source of the row (either from X or Y in the merge). As near as I can tell, there's nothing like that in the merge()
function, so I am trying to write my own wrapper for merge()
that will do this.
Example:
example_df1 <- data.frame(subject_id=c(101,102,103,104,105),
gender=c("M","F","M","M","F"),
weight=c(120,130,110,114,144),
score=c(10,12,11,13,11))
example_df2 <- data.frame(subject_id=c(101,102,103,104,105),
gender=c("M","M","M","M","F"),
weight=c(120,130,110,117,144),
site1=c(13,18,23,12,4),
site2=c(3,7,8,11,0),
site3=c(31,28,12,29,40))
merge(x=example_df1,y=example_df2,all=TRUE)
subject_id gender weight score site1 site2 site3
1 101 M 120 10 13 3 31
2 102 F 130 12 NA NA NA
3 102 M 130 NA 18 7 28
4 103 M 110 11 23 8 12
5 104 M 114 13 NA NA NA
6 104 M 117 NA 12 11 29
7 105 F 144 11 4 0 40
Desired output:
subject_id gender weight score site1 site2 site3 rowsource
1 101 M 120 10 13 3 31 both
2 102 F 130 12 NA NA NA x
3 102 M 130 NA 18 7 28 y
4 103 M 110 11 23 8 12 both
5 104 M 114 13 NA NA NA x
6 104 M 117 NA 12 11 29 y
7 105 F 144 11 4 0 40 both
I need to implement the solution in base R without any special packages if at all possible due to the regulatory environment surrounding the project. My initial thought is to try to use intersect
to find the common variables between both example_df1
and example_df2
, and then to somehow compare each row of the merge result (within those common variables) against both example_df1
and example_df2
to figure out the source of the row. That seems really unwieldy, so I'd appreciate suggestions on how to improve the efficiency of this kind of task. Thanks!
EDITED TO ADD: If R always consistently puts the X row above the Y row in merges of this type, I suppose that could work too, but I think I'd feel better about something more stable than that.