ddply to apply function on data by groups

https://stackoverflow.com/questions/21537872

06-10-2022
|

Question

Situation:

Here is the data I have:

> head(data1)
  CHROM   POS REF ALT DIFF GT
1 chr01 14653   C   T  254 CT     
2 chr01 14907   A   G  254 AG     
3 chr01 14930   A   G   23 AG     
4 chr01 15190   G   A  260 GA     
5 chr01 15211   T   G   21 TG     
6 chr01 16378   T   C 1167 TC     

> tail(data1)
154176  chrX 154901366   T   A 58700 TA     
154177  chrX 154901404   A   T    38 AT     
154178  chrX 154933406   A   G 32002 AG     
154179  chrX 154933419   A   T    13 AT     
154180  chrX 154933451   T   C    32 TC     
154181  chrX 154933473   G   T    22 GT

CHROM has categorical value from chr01 to chr22 plus chrX (total of 23)
GT is categorical (combination of two of A C G T) (total of 12 possibilities)

What I want to do:

Group POS by 1e7. I have done it using data1$POSgroup <- floor(data1$POS / 1e7)
calculate the mean for each POSgroup and CHROM group. So I should have total of #POSgroup *#CHROM mean values as a data set.

The code I have now can only get the mean value grouped by POS group but not CHROM group.

Code:

datsum <- ddply(data1, .var = "POSgroup", .fun = function(x) {

  # Calculate the mean DIFF value for each GT group in this POSgroup
  meandiff <- ddply(x, .var = "GT", .fun = summarise, ymean = mean(DIFF))

  # Add the center of the POSgroup range as the x position
  meandiff$center <- (x$POSgroup[1] * 1e7) + 0.5e7

  # Return the results
  meandiff

})

Can anyone help me with this?

Solution

Using data.table, this will give you a starting point:

library(data.table)
dt = data.table(data1)

dt[, mean(DIFF), by = list(floor(CHROM/1e7), floor(POS/1e7))]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow