Substract a large number of specified rows from dataset in R

https://stackoverflow.com/questions/21807936

r
genome

12-10-2022
|

Pergunta

I have a two very large lists of genes, A and B. A has two columns: GeneID and p-value, while B has only one column, GeneID. There are approximately 100,000 Genes in B and these are a subset of the genes in A (about 700,000 Genes here):

GeneListA
GeneID    p.value
41931     0.0210
41931     0.0003
5310612   0.3161
5310612   0.7089
5310612   0.0021
98317     0.1139
98317     0.0009
215688    0.0031
215688    0.0008

GeneListB 
GeneID
41931
41931
215688
215688

Desired GeneListC
5310612   0.3161
5310612   0.7089
5310612   0.0021
98317     0.1139
98317     0.0009

I do not want the genes in B to show up in A anymore. How do I get rid of them while still keeping my p-values in A? I tried three different methods so far:

I got rid of my p-value column so there is only Entrez Gene ID's for both lists. Then I employed the following code: new<-A[setdiff(rownames(A),rownames(B)),], but I got a completely different set of genes than expected. It was a seemingly random mixture of genes from A and B, rather than A-B
I also tried: new<-A[!apply(A,1,FUN=function(y){any(apply(B,1,FUN=function(x){all(x==y)}))}),]
Finally, I tried to merge by EntrezGeneID, but that was useless as well.

I'm getting destroyed by this, so any help would be appreciated.

Solução

You can subset the data frame by the %in% operator.

GeneListA[!GeneListA$GeneID %in% GeneListB$GeneID, ]

Combined with ! the statement becomes, give me all in GeneListA where GeneID is not in GendeID from GeneListB.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow