質問

I have a matrix with a column of different genes and a corresponding column of the -log(P-values) for each possible SNP for every gene.

So the matrix has 3 columns: Gene_lable, SNP and minus_logpval. I'm trying to write a code that identifies the SNP with the highest -log(P-value) for each gene. Here's the head(data):

  SNP           Gene_label           minus_logpval
1 rs3934834 HES4/ENSG00000188290       14.1031
2 rs3766193 HES4/ENSG00000188290        7.0203
3 rs3766192 HES4/ENSG00000188290       10.7420
4 rs3766191 HES4/ENSG00000188290       10.4323
5 rs9442371 HES4/ENSG00000188290       10.2941
6 rs9442372 HES4/ENSG00000188290        8.4235

This is the start of the code:

for(i in 1:254360) {
max_pval = 0
if(data$Gene_label[i]==data$Gene_label[i+1]) {
    x = array(NA, dim=c(0,2));
    x[i] = data$minus_logpval[i];
    x[i+1] = data$minus_logpval[i+1];
    temp = max(x);
    if (temp>max_pval) {
    max_pval=temp
    line = i
    }

But for some reason, R keeps giving me the error: Error in is.ordered(x) : argument "x" is missing, with no default. I didn't even use the is-ordered(x) function... I think the error's in the way I initialized x (which should be an array) but I don't know how to fix it.

役に立ちましたか?

解決

A perfect use for ddply from plyr. Split the data.frame up into subsets (by Gene_label ) and operate on each piece (find the snp which relates to the max of minus_logpval ):

##  Reproducible example data
set.seed(1234)
df <- data.frame( Gene_label = rep( letters[1:3] , 3 ) , snp = rep( letters[5:7] , each = 3 ) , minus_logpval = rnorm(9) )
df
#  Gene_label snp minus_logpval
#1          a   e    -1.2070657
#2          b   e     0.2774292
#3          c   e     1.0844412
#4          a   f    -2.3456977
#5          b   f     0.4291247
#6          c   f     0.5060559
#7          a   g    -0.5747400
#8          b   g    -0.5466319
#9          c   g    -0.5644520

##  And a single line using 'ddply'
require(plyr)
ddply( df , .(Gene_label) , summarise , SNP = snp[which.max(minus_logpval)] )
#  Gene_label SNP
#1          a   g
#2          b   f
#3          c   e

他のヒント

You can try it withou a loop using a tapply

tab <- expand.grid(gene=letters[1:2], SNP=LETTERS[1:3])
tab$minus_logpval <- abs(rnorm(6))*-1
tab <- tab[do.call("order", tab),]
tab$SNP <- as.character(tab$SNP)
with(tab, tapply(minus_logpval, gene, function(x) SNP[which.max(x)]))

HTH

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top