R: subsetting data based on whether a condition is met by a specific number of columns

https://stackoverflow.com/questions/23136897

05-07-2023
|

Pregunta

I have a dataframe of log2(expression-values) of genes having dimension:

>dim(vst.df)
34215 rows and 64 cols

The 64 columns refer to 22 controls and 42 cases. Rows refer to 34215 genes.

The dataframe looks like this:

>head(vst.df)[,1:5]
                        sam1      sam2      sam3      sam4      sam5
 ENSG00000000003.10 8.246215  8.671092  8.529269  8.621316  8.415544
 ENSG00000000005.5  5.187977  6.323024  6.022986  5.376513  4.810042
 ENSG00000000419.8  9.654394 10.130017 10.495403 10.209688 10.137285
 ENSG00000000457.9  8.637566  8.604159  8.681583  8.668491  8.874946
 ENSG00000000460.12 7.071433  7.302448  7.499133  7.441582  7.439453
 ENSG00000000938.8  8.713285  8.584996  8.982816  9.787420  8.823927

The colnames are sampleNames (from sam1...sam64) and rownames are geneIDs. Which sampleNames are cases and which are controls, is given by:

 >head(pData)
 sample_name status  
        sam1   case   
        sam2 contrl  
        sam3 contrl    
        sam4   case  
        sam5   case

The minimum value in the datframe vst is:

 >min(vst.df)
 4.10438

I need to filter the dataframe vst.df such that EITHER 80% or more of all controls have values >4.10438 OR 80% or more of all cases have values >4.10438, per gene.

My approach:

#separate the controls and cases in different dataframes
vst.controls <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="contrl"),1]))]
vst.cases    <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="case"),1]))]

#80% of controls is approx. 18
#if 80% or more controls have a value >4.10438, then the rowSums must be > round(4.10438*18)=74
vst.controls <- vst.controls[which(rowSums(vst.controls)>74),]

#similarly for cases
#80% of cases is approx. 34
#if 80% or more cases have a value >4.10438, then the rowSums must be > round(4.10438*34)=140
vst.cases <- vst.cases[which(rowSums(vst.cases)>140),]

Actually I know my approach is incorrect, I just wanted to show that I have tried something before posting a question here. How do I go about solving the issue?

UPDATE 1: I am showing rows from the controls' dataframe because it is smaller than cases'.

#row where 12 columns (<18 columns) meet the condition
vst.controls[6144,]

                    C00060  C00079   C00135  C00150   C00154  C00176  C00182   P01121  P01160  P01165   P01183   P01200   P01202  P01215   P01226   P01248   P01259
ENSG00000129824.11 4.10438 4.10438 4.903374 4.10438 5.051641 4.10438 4.10438 12.64946 4.10438 4.10438 12.14679 12.45381 12.36571 4.10438 12.05378 12.37071 12.22021
                    P01270   P01273  P01277   P01294   P01325
ENSG00000129824.11 4.10438 12.30081 4.10438 13.38687 12.07337

#row where 20 columns (>18 columns) meet the condition
vst.controls[94,]
                   C00060   C00079  C00135   C00150   C00154   C00176  C00182   P01121  P01160  P01165   P01183   P01200   P01202  P01215   P01226   P01248   P01259
ENSG00000005421.4 4.10438 5.439795 5.25585 6.207467 4.810042 5.459054 5.83844 5.573587 4.93365 4.10438 5.660449 5.075977 5.367907 4.74712 5.016934 5.350099 5.098586
                    P01270  P01273   P01277   P01294  P01325
ENSG00000005421.4 5.719316 4.80001 5.431398 5.553477 4.76463

UPDATE 2:

When I use this:

class(vst.controls)
[1] "data.frame"

class(vst.controls[1644,])
[1] "data.frame"

class(vst.controls[94,])
[1] "data.frame"

rowMeans(vst.controls[1644,] > 4.10438) #it returns me the below
ENSG00000084774.9 
                1 

rowMeans(vst.controls[94,] > 4.10438) #it returns me the below
ENSG00000005421.4 
                1

Thanks,

Solución

One way is to get how many columns have a value higher than the threshold. You can do that using rowSums(vst.controls > 4.10438) assuming you have no other columns except the data to be used for subsetting (i.e. vst.controls has exactly 22 columns). Then, the condition becomes which sum of TRUEs is higher that 80% of the total number of cases. For vst.controls that condition becomes

which(rowSums(vst.controls > 4.10438) > 18)    ## 80% of 22 is 17.6

Or even better, use rowMeans to have the rate of successes computed directly (regardles of the number of columns):

valid.controls <- which(rowMeans(vst.controls > 4.10438) > 0.8)
valid.cases <- which(rowMeans(vst.cases > 4.10438) > 0.8)
valid <- union(valid.controls, valid.cases)

That will give you a vector of indices that satisfy your condition.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow