Question

I have a data frame (DF) that looks as follows:

structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE", 
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1", 
"1"), VALUE3 = c("2", "1", "1", "2", "1"), VALUE4 = c("2", "1", 
"2", "1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3", 
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")

Uniqueness in this case is defined by ID and REPORTER. So the DF above contains a duplicate for the ID 123 and REPORTER ONE and the ID 789 and REPORTER THREE. Since I cannot tell which values of VALUE1 to VALUE4 are the correct ones, I like to set all values to NA, that differ within a duplicate.

This means I first have to identify the columns of VALUE that contain different values. These are the ones to be set to NA. For the rest I like to keep the data since here I can tell the value is correct.

The expected output would look like this:

structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE", 
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1", 
"1"), VALUE3 = c(NA, NA, "1", NA, NA), VALUE4 = c(NA, NA, "2", 
"1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3", 
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")

The goal is to ensure data quality. I don't like to just remove the problem cases since I can use the not differing values for analysis. But I also do not like to just use one of the rows because this would lead to wrong conclusions if I had chosen the wrong values.

How can I do this?

Was it helpful?

Solution

I think this is what you are looking for:

library(reshape2)
DFL <- melt(cbind(rn = 1:nrow(DF), DF), id.vars=c("rn", "ID", "REPORTER"))
DFL$value2 <- ave(DFL$value, DFL[c("ID", "REPORTER", "variable")], 
                  FUN = function(x) {
                    ifelse(length(unique(x)) > 1, NA, x)
                  })
dcast(DFL, rn + ID + REPORTER ~ variable, value.var = "value2")
#   rn  ID REPORTER VALUE1 VALUE3 VALUE4
# 1  1 123      ONE      1   <NA>   <NA>
# 2  2 123      ONE      1   <NA>   <NA>
# 3  3 456      TWO      2      1      2
# 4  4 789    THREE      1   <NA>      1
# 5  5 789    THREE      1   <NA>      1

As you can see, I had to add a dummy "rn" supplementary ID variable to make sure that dcast wouldn't just collapse all the values into one row per ID+REPORTER combination.


Update

This is actually also entirely doable with base R's reshape and the ave step described above:

DFL <- reshape(DF, direction = "long", 
               varying = grep("VALUE", names(DF)), sep = "")
DFL <- within(DFL, {
  VALUE <- ave(VALUE, ID, REPORTER, time, FUN = function(x) 
    ifelse(length(unique(x)) > 1, NA, x))
})
reshape(DFL)
#      ID REPORTER id VALUE1 VALUE3 VALUE4
# 1.1 123      ONE  1      1   <NA>   <NA>
# 2.1 123      ONE  2      1   <NA>   <NA>
# 3.1 456      TWO  3      2      1      2
# 4.1 789    THREE  4      1   <NA>      1
# 5.1 789    THREE  5      1   <NA>      1

In the last line above, the attributes from the original reshape statement make it so we don't have to even worry about what arguments we need to put in. :-)

OTHER TIPS

I created a function replaceDifferent() that looks like this:

replaceDifferent <- function(vector){
  max <- max(vector)
  min <- min(vector)
  test <- max == min
  if (!test){
    return(NA)
  }
  else{
    return(min(vector))
  }
}

Then I melted the DF with melt() from the reshape package:

DFmelt <- melt(DF, id = c("ID", "REPORTER"))

After that I was able to apply the new function to the melted data frame wit ddply()

DFres <- ddply(DFmelt, .(ID, REPORTER, variable), function(x){replaceDifferent(x$value)})

To get the result data frame with duplicates removed I called dcast() on DFres:

DFres <- dcast(DFres, ID+REPORTER ~ variable)

This produces a slightly different output than the one I asked for, but is better in the way that I do not have to deal with duplicates anymore.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top