R select data of each row on basis of the row quantiles

https://stackoverflow.com//questions/20052195

26-12-2019
|

Question

I have a data set with 60 rows and 3036 columns. I have already calculated the row quantiles with the function rowQuantiles from the matrixStats package. From this I got a column vector [60,1]. Now, I want to select from each row only the data which is higher than the row quantile. If I use the which function as following:

datset_qu95 = which(dataset > rowQuantiles(dataset, probs=c(0.95))

then I loose the data dimensions and get only an array instead of a matrix with the following dimensions [60,152].

Can somebody help me?

Thank you!

Solution

I don't think a rowQuantile function is needed. Just pick out the highest values up to a probability threshold: (edit note (first version had incorrect index expression)

> apply( dat, 1, function(x) x[order(x)][1:( (1-0.95)*ncol(dat))])
    obs1     obs2     obs3 
 11.5379 856.3470 136.8860

And as always, because R matrices are column oriented, you will probably want to use t() on the result to get this back into the row orientation you expect.

To your comment: Fixed it so it picks up the highest values rather than the lowest values:

 apply( dat, 1, function(x)
                  x[order(x, decreasing=TRUE)][1:( (1-0.95)*ncol(dat))])

OTHER TIPS

0.05 * 3036 = 151.8 but selecting in each row the values greater than the 95% quantile does not mean that you will have systematically 152 values. If you want to keep your object dimensions you can try to replace undesired values with NA's.
As your object is not huge you could also work with data frame objects and have your observations along the row dimension.

library(matrixStats)

# To extract your values...
myfun <- function(k, q){x[k, x[k,] > q]}
x <- matrix(sample(1:100, 60*3036, replace=TRUE), ncol=3036)
xx <- mapply(myfun, seq(along=x[,1]), rowQuantiles(x, probs=.95))
# xx is a list, xx[[1]] contains the values of x[1,] > quantile(x[1, ], .95)

# The number of selected values depends on their distribution - with NORM should be stable
x11() ; par(mfrow=c(2,1))
hist(sample(1:100, 60*3036, replace=TRUE)) # UNIF DISTRIB
n.val <- sapply(xx, length)
hist(n.val, xlab="n.val > q_95%")
abline(v=152, col="red", lwd=5)

# Assuming you want the same number of value for each row
n <- min(n.val)
myfun <- function(x){sample(x, n)} # Representative sample - Ordering is possible but introduce bias. Depends on your goals
xx <- t(sapply(xx, myfun))
dim(xx) # 60 n

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow