R, filter matrix based on variance cut-offs

Question 1

Ok, assuming you have a matrix (so I am assuming that your ID column is actually rownames) then this is very simple to do.

#  First find the desired quantile breaks for the entire matrix
qt <- quantile( m , probs = c(0.2,0.8) )
# 20%  80% 
#5.17 6.62 
#  Next get a logical vector of the rows that have any values outside these breaks
rows <- apply( m , 1 , function(x) any( x < qt[1] | x > qt[2] ) )
#  Subset on this vector
m[ rows , ]
#            sample1 sample2 sample3 sample4 sample5 sample6
#ILMN_1762337    7.86    5.05    4.89    5.74    6.78    6.41
#ILMN_2055271    5.72    4.29    4.64    5.00    6.30    8.02
#ILMN_1736007    3.82    6.48    6.06    7.13    8.20    4.06
#ILMN_2383229    6.34    4.34    6.12    6.83    4.82    5.57
#ILMN_1806310    6.15    6.37    5.54    5.22    4.59    6.28
#ILMN_1653355    7.01    4.73    6.62    6.27    4.77    6.12
#ILMN_1705025    6.09    6.68    6.80    6.85    8.35    4.15
#ILMN_1814316    5.77    5.17    5.94    6.51    7.12    7.20

The any( x < qt[1] | x > qt[2] ) part of the apply function (which is designed to apply a function across the margins of a matrix) returns TRUE if any value in that row is outside the 20% and 80% quantiles of your sample matrix. By definition, if no value is outside these bounds it returns FALSE indicating we will drop that row in the next line.

Question 2

The Biocondcutor genefilter package provides common filters relevant to microarray analysis. A typical filter based on row-wise variability would be

m = matrix(rnorm(47000 * 6), 47000)
varFilter(m)

The package landing page references vignettes illustrating basic operation and providing diagnostic guidance for use of filtering.

A basic principle in the analysis of microarrays is that values in a row are comparable, but not values between rows. This is because the probes associated with each row have distinct characteristics that introduce row-specific bias -- a value in the first row could reasonably indicate more, less or equal gene expression compared to a value for the same sample in a second row. This means that @Todd's desire to normalize based on between-row comparison (largest and smallest values in the entire matrix) is not recommended. Instead, varFilter calculates a measure of variability of each row (row inter-quartile range) and selects a fraction (the var.cutoff argument) with most variability.

A quick peak at the definition of varFilter shows that in general this is no more tricky than, for some measure of row-wise variability var.func and a (single) quantile var.cutoff

vars <- apply(m, 1, var.func)
m[vars > quantile(vars, var.cutoff), ]

Question 3

I am not a statistician, So I don't know if there is a general method to resolve this. For me the problem will be simpler if you reshape your data in the long format.

library(reshape2)
dat.m <- melt(dat)
dat.m$value <- as.numeric(dat.m$value)
head(dat.m)
            ID variable value
1 ILMN_1762337  sample1  7.86
2 ILMN_2055271  sample1  5.72
3 ILMN_1736007  sample1  3.82
4 ILMN_2383229  sample1  6.34
5 ILMN_1806310  sample1  6.15
6 ILMN_1653355  sample1  7.01

Then for each variable you do the following :

Compute limits using quantile
remove genes that don't satisfy the condition.

You can do this for example , using ddply from plyr:

res <- ddply(dat.m,.(variable),function(x){
  ## compute limits for each sample
  z <- x$value
  qq <- quantile(z, probs = c(0.2,0.8))
  ## keep only genes with high or low variance
  dd <- x[z < qq[1] | z > qq[2],]
})
## return to the wide format
acast(res,ID~variable)

            sample1 sample2 sample3 sample4 sample5 sample6
ILMN_1653355    7.01      NA    6.62      NA    4.77      NA
ILMN_1705025      NA    6.68    6.80    6.85    8.35    4.15
ILMN_1736007    3.82    6.48      NA    7.13    8.20    4.06
ILMN_1762337    7.86      NA    4.89      NA      NA      NA
ILMN_1806310      NA      NA      NA    5.22    4.59      NA
ILMN_1814316      NA      NA      NA      NA      NA    7.20
ILMN_2055271    5.72    4.29    4.64    5.00      NA    8.02
ILMN_2383229      NA    4.34      NA      NA      NA      NA

EDIT after OP clarification , if you want the 20% and 80% cutoff values for the entire matrix not just for each individual sample, you compute qq outside the ddply

   qq <- quantile(dat.m$value, probs = c(0.2,0.8))

Then you comment the corresponding line , like this :

res <- ddply(dat.m,.(variable),function(x){
  z <- x$value
  ## keep only genes with high or low variance
  dd <- x[z < qq[1] | z > qq[2],]
})

PS here dat is :

dat <- read.table(text='         ID    sample1 sample2 sample3 sample4 sample5 sample6
ILMN_1762337    7.86    5.05    4.89    5.74    6.78    6.41
ILMN_2055271    5.72    4.29    4.64    5.00    6.30    8.02
ILMN_1736007    3.82    6.48    6.06    7.13    8.20    4.06
ILMN_2383229    6.34    4.34    6.12    6.83    4.82    5.57
ILMN_1806310    6.15    6.37    5.54    5.22    4.59    6.28
ILMN_1653355    7.01    4.73    6.62    6.27    4.77    6.12
ILMN_1705025    6.09    6.68    6.80    6.85    8.35    4.15
ILMN_1814316    5.77    5.17    5.94    6.51    7.12    7.20
ILMN_1814317    5.97    5.97    5.97    5.97    5.97    5.97
ILMN_1814318    5.97    5.97    5.97    5.97    5.97    5.97
ILMN_1814319    5.97    5.97    5.97    5.97    5.97    5.97',header=TRUE)