I have a dataframe of log2(expression-values) of genes having dimension:
>dim(vst.df)
34215 rows and 64 cols
The 64 columns refer to 22 controls and 42 cases. Rows refer to 34215 genes.
The dataframe looks like this:
>head(vst.df)[,1:5]
sam1 sam2 sam3 sam4 sam5
ENSG00000000003.10 8.246215 8.671092 8.529269 8.621316 8.415544
ENSG00000000005.5 5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000419.8 9.654394 10.130017 10.495403 10.209688 10.137285
ENSG00000000457.9 8.637566 8.604159 8.681583 8.668491 8.874946
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8 8.713285 8.584996 8.982816 9.787420 8.823927
The colnames are sampleNames (from sam1...sam64) and rownames are geneIDs. Which sampleNames are cases and which are controls, is given by:
>head(pData)
sample_name status
sam1 case
sam2 contrl
sam3 contrl
sam4 case
sam5 case
The minimum value in the datframe vst is:
>min(vst.df)
4.10438
I need to filter the dataframe vst.df such that EITHER 80% or more of all controls have values >4.10438 OR 80% or more of all cases have values >4.10438, per gene.
My approach:
#separate the controls and cases in different dataframes
vst.controls <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="contrl"),1]))]
vst.cases <- vst.df[,which(colnames(vst.df) %in% as.character(pData[which(pData$status=="case"),1]))]
#80% of controls is approx. 18
#if 80% or more controls have a value >4.10438, then the rowSums must be > round(4.10438*18)=74
vst.controls <- vst.controls[which(rowSums(vst.controls)>74),]
#similarly for cases
#80% of cases is approx. 34
#if 80% or more cases have a value >4.10438, then the rowSums must be > round(4.10438*34)=140
vst.cases <- vst.cases[which(rowSums(vst.cases)>140),]
Actually I know my approach is incorrect, I just wanted to show that I have tried something before posting a question here. How do I go about solving the issue?
UPDATE 1: I am showing rows from the controls' dataframe because it is smaller than cases'.
#row where 12 columns (<18 columns) meet the condition
vst.controls[6144,]
C00060 C00079 C00135 C00150 C00154 C00176 C00182 P01121 P01160 P01165 P01183 P01200 P01202 P01215 P01226 P01248 P01259
ENSG00000129824.11 4.10438 4.10438 4.903374 4.10438 5.051641 4.10438 4.10438 12.64946 4.10438 4.10438 12.14679 12.45381 12.36571 4.10438 12.05378 12.37071 12.22021
P01270 P01273 P01277 P01294 P01325
ENSG00000129824.11 4.10438 12.30081 4.10438 13.38687 12.07337
#row where 20 columns (>18 columns) meet the condition
vst.controls[94,]
C00060 C00079 C00135 C00150 C00154 C00176 C00182 P01121 P01160 P01165 P01183 P01200 P01202 P01215 P01226 P01248 P01259
ENSG00000005421.4 4.10438 5.439795 5.25585 6.207467 4.810042 5.459054 5.83844 5.573587 4.93365 4.10438 5.660449 5.075977 5.367907 4.74712 5.016934 5.350099 5.098586
P01270 P01273 P01277 P01294 P01325
ENSG00000005421.4 5.719316 4.80001 5.431398 5.553477 4.76463
UPDATE 2:
When I use this:
class(vst.controls)
[1] "data.frame"
class(vst.controls[1644,])
[1] "data.frame"
class(vst.controls[94,])
[1] "data.frame"
rowMeans(vst.controls[1644,] > 4.10438) #it returns me the below
ENSG00000084774.9
1
rowMeans(vst.controls[94,] > 4.10438) #it returns me the below
ENSG00000005421.4
1
Thanks,