calculating concordance rates in duplicate samples in pandas in python or in R

https://stackoverflow.com/questions/20602696

02-09-2022
|

Question

I have a file that looks like this: (note: my actual file has dimensions of 1000x5000, so I made a short version here)

>duplicates

markerid    1A  1B  2A  2B  3A  3B
rs1512      CC  CC  CT  CC  CC  TT
rs1779      TT  TG  TG  TT  --  TG
rs12743     TT  TG  TG  TT  TT  TT
rs13229     CC  GC  CC  --  CC  CC
rs1328      CC  CC  GG  GG  CG  CG

The first column contains ids of markers that each individual was tested for. The subsequent columns contain the individuals tested in duplicates.

For example 1A and 1B are duplicates of sample 1. Same applies to 2A and 2B, and 3A and 3B.

I am trying to obtain the duplicate concordance rate per sample. That is, I want to know the proportion of times that the markerid letters for sample 1A are the same as for sample 1B, then compare sample 2A and 2B and get concordance rates and so on.

So for example, for samples 1A and 1B, they only match for 4/5 markerids.

I want to generate a final output file that has a very simple format:

>concordance_rate
concordance
0.8
0.2
0.6

Where the first row is the concordance rate for sample 1, second row is concordance rate for sample 2 and so on.

I'm thinking that the way to do this would be to count the number of times that column2 matches colum3 and then divide that by the length of either column, and then make that in a loop for subsequent sets of two columns in a data frame. But I honestly am stuck in how to code for this properly so I am asking for help. I am learning programming (in R and using pandas module in Python) slowly so the help will be greatly appreciated. Thank you.

La solution

This will do the job. Note that my data are not precisely the same as yours, but that 1A and 1B match in 3/5 cases and 2A and 2B match in 4/5 cases.

markers = data.frame(
"1A" = c("CC", "TT", "TT", "CC", "CC"),
"1B" = c("CC", "TG", "TT", "CG", "CC"),
"2A" = c("CC", "TT", "TT", "CC", "CC"),
"2B" = c("CC", "TT", "TT", "CC", "CG"),
stringsAsFactors = FALSE
)
#
concordance = sapply(seq(1, ncol(markers), 2), function(c) {
  match = sum(markers[, c] == markers[, c+1]) / nrow(markers)
})
print(concordance)

The output is

> print(concordance)
[1] 0.6 0.8

This should generalise pretty well to a larger data set. You might want to put in some logic to test that your data frame has an even number of columns.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow