Selecting between duplicate data in a data frame

https://stackoverflow.com/questions/7841867

10-02-2021
|

Question

Earlier I asked a question about extracting duplicate lines from a data frame. I now need to run a script to decide which of these duplicates to keep in my final data set.

Duplicate entries in this data set have the same 'Assay' and 'Sample' values. Here is the first 10 lines of the new data set Im working with containing my duplicate entries:

     Assay   Sample    Genotype   Data
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

Id like to run a script that will break these duplicate samples into 4 bins based on the value for 'Data' which can be either 1, 0, or NA:

 1) All values for 'Data' are NA
 2) All values for 'Data' are identical, no NA
 3) At least 1 value for 'Data' is not identical, no NA.
 4) At least 1 value for 'Data' is not identical, at least one is NA.

The expected result from the above data would look like this;

Set 1
Null

Set 2
5  CCT6-002   0050         G        0
6  CCT6-002   0050         G        0

Set 3
1  CCT6-002   1486         A        1
2  CCT6-002   1486         G        0
7  CCT6-015   0082         G        0
8  CCT6-015   0082         T        1

Set 4
3  CCT6-002   1997         G        0
4  CCT6-002   1997         NA       NA
9  CCT6-015   0121         G        0
10 CCT6-015   0121         NA       NA

There are cases in which more than 2 "duplicate" data points exist in this data set. Im not sure even where to start with this as Im a newbie to R.

EDIT: With expected data.

Solution

This should be a good start. Depending on how long your dataset is, it may or may not be worth it to optimize this for better speed.

require(plyr)

# Read data
data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA))

# Function to pick set
pickSet <- function(x) {
  if(all(is.na(x$Data))) {
    set = 1
  } else if(length(unique(x$Data)) == 1) {
    set = 2
  } else if(!any(is.na(x$Data))) {
    set = 3
  } else {
    set = 4
  }
  data.frame(Set=set)
}

# Identify Set for each combo of Assay and Sample
sets = ddply(data, c('Assay', 'Sample'), pickSet)

# Merge set info back with data
data = join(data, sets)

# Reformat to list
sets.list = lapply(1:4, function(x) data[data$Set==x,-5])

> sets.list
[[1]]
[1] Assay    Sample   Genotype Data    
<0 rows> (or 0-length row.names)

[[2]]
     Assay Sample Genotype Data
5 CCT6-002   0050        G    0
6 CCT6-002   0050        G    0

[[3]]
     Assay Sample Genotype Data
1 CCT6-002   1486        A    1
2 CCT6-002   1486        G    0
7 CCT6-015   0082        G    0
8 CCT6-015   0082        T    1

[[4]]
      Assay Sample Genotype Data
3  CCT6-002   1997        G    0
4  CCT6-002   1997     <NA>   NA
9  CCT6-015   0121        G    0
10 CCT6-015   0121     <NA>   NA

OTHER TIPS

You have asked a question that veers in the direction of asking others to do your entire work for you. A question about a single, specific piece of this project would probably be more likely to attract a response. The piece you are struggling with that is preventing you from starting is a very basic programming skill: the ability to break your problem down into small concrete steps, solve each one individually and then put them together again to solve your original problem.

That skill is also very hard to learn, though. But you have a good start! You have nicely specified the four groups your data can fall into:

All values for 'Data' are NA
All values for 'Data' are identical, no NA
At least 1 value for 'Data' is not identical, no NA.
At least 1 value for 'Data' is not identical, at least one is NA.

Now you need to think about how, if you have just one subset of your data, can you figure out how to determine in R which group (1-4) it is in? The following is a sketch of some tools that might be useful for doing this. Build a few subsets and play around in the console until you feel comfortable identifying each group:

(1) Are all values for datSub$Data NAs?

Tools: all and is.na

(2) Only one unique value, not NA?

Tools: length, unique, is.na, any

(3) More than one unique value, no NAs?

Tools: length, unique, any, is.na

(4) More than one unique value, at least one NA?

Tools: length, unique, any, is.na

It may be possible to do this without using all these functions, but they are all potentially useful.

Once you know how to determine which group a particular subset should be in, you are ready to wrap that code into a function. My suggestions would be to create a new column with the value 1-4 depending on which group that subset falls in:

myFun <- function(x){
    if (...){
        x$grp <- 1
    }
    if (...){
        x$grp <- 2
    }
    #etc.

    return(x)
}

Then use ddply to apply this function to each subset of your data based on the values of Sample:

ddply(dat,.(Sample),.fun = myFun)

And finally split this data frame on its new grp variable:

split(dat,dat$grp)

Hopefully, this general sketch helps to get you started. But you will have problems. Everyone does. If you run into specific problems along the way, feel free to ask another question about that.

Indeed, I see now that John has posted an answer along the lines of my sketch. However, I will post this answer anyway in the hopes that it helps you to analyze future problems.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow