Selecting between duplicate data in a data frame
-
10-02-2021 - |
Question
Earlier I asked a question about extracting duplicate lines from a data frame. I now need to run a script to decide which of these duplicates to keep in my final data set.
Duplicate entries in this data set have the same 'Assay' and 'Sample' values. Here is the first 10 lines of the new data set Im working with containing my duplicate entries:
Assay Sample Genotype Data
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
Id like to run a script that will break these duplicate samples into 4 bins based on the value for 'Data' which can be either 1, 0, or NA:
1) All values for 'Data' are NA
2) All values for 'Data' are identical, no NA
3) At least 1 value for 'Data' is not identical, no NA.
4) At least 1 value for 'Data' is not identical, at least one is NA.
The expected result from the above data would look like this;
Set 1
Null
Set 2
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
Set 3
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
Set 4
3 CCT6-002 1997 G 0
4 CCT6-002 1997 NA NA
9 CCT6-015 0121 G 0
10 CCT6-015 0121 NA NA
There are cases in which more than 2 "duplicate" data points exist in this data set. Im not sure even where to start with this as Im a newbie to R.
EDIT: With expected data.
Solution
This should be a good start. Depending on how long your dataset is, it may or may not be worth it to optimize this for better speed.
require(plyr)
# Read data
data = read.table('data.txt', colClasses=c(NA, NA, 'character', NA, NA))
# Function to pick set
pickSet <- function(x) {
if(all(is.na(x$Data))) {
set = 1
} else if(length(unique(x$Data)) == 1) {
set = 2
} else if(!any(is.na(x$Data))) {
set = 3
} else {
set = 4
}
data.frame(Set=set)
}
# Identify Set for each combo of Assay and Sample
sets = ddply(data, c('Assay', 'Sample'), pickSet)
# Merge set info back with data
data = join(data, sets)
# Reformat to list
sets.list = lapply(1:4, function(x) data[data$Set==x,-5])
> sets.list
[[1]]
[1] Assay Sample Genotype Data
<0 rows> (or 0-length row.names)
[[2]]
Assay Sample Genotype Data
5 CCT6-002 0050 G 0
6 CCT6-002 0050 G 0
[[3]]
Assay Sample Genotype Data
1 CCT6-002 1486 A 1
2 CCT6-002 1486 G 0
7 CCT6-015 0082 G 0
8 CCT6-015 0082 T 1
[[4]]
Assay Sample Genotype Data
3 CCT6-002 1997 G 0
4 CCT6-002 1997 <NA> NA
9 CCT6-015 0121 G 0
10 CCT6-015 0121 <NA> NA
OTHER TIPS
You have asked a question that veers in the direction of asking others to do your entire work for you. A question about a single, specific piece of this project would probably be more likely to attract a response. The piece you are struggling with that is preventing you from starting is a very basic programming skill: the ability to break your problem down into small concrete steps, solve each one individually and then put them together again to solve your original problem.
That skill is also very hard to learn, though. But you have a good start! You have nicely specified the four groups your data can fall into:
All values for 'Data' are NA
All values for 'Data' are identical, no NA
At least 1 value for 'Data' is not identical, no NA.
At least 1 value for 'Data' is not identical, at least one is NA.
Now you need to think about how, if you have just one subset of your data, can you figure out how to determine in R which group (1-4) it is in? The following is a sketch of some tools that might be useful for doing this. Build a few subsets and play around in the console until you feel comfortable identifying each group:
(1) Are all values for datSub$Data
NA
s?
Tools: all
and is.na
(2) Only one unique value, not NA
?
Tools: length
, unique
, is.na
, any
(3) More than one unique value, no NA
s?
Tools: length
, unique
, any
, is.na
(4) More than one unique value, at least one NA
?
Tools: length
, unique
, any
, is.na
It may be possible to do this without using all these functions, but they are all potentially useful.
Once you know how to determine which group a particular subset should be in, you are ready to wrap that code into a function. My suggestions would be to create a new column with the value 1-4 depending on which group that subset falls in:
myFun <- function(x){
if (...){
x$grp <- 1
}
if (...){
x$grp <- 2
}
#etc.
return(x)
}
Then use ddply
to apply this function to each subset of your data based on the values of Sample
:
ddply(dat,.(Sample),.fun = myFun)
And finally split this data frame on its new grp
variable:
split(dat,dat$grp)
Hopefully, this general sketch helps to get you started. But you will have problems. Everyone does. If you run into specific problems along the way, feel free to ask another question about that.
Indeed, I see now that John has posted an answer along the lines of my sketch. However, I will post this answer anyway in the hopes that it helps you to analyze future problems.