removing columns with similar variance

https://stackoverflow.com/questions/18253483

24-06-2022
|

Вопрос

I have a dataframe of 3500 X 4000. I am trying to write a professional command in R to remove any columns in a matrix that show the same variance. I am able to do this a with a long, complicated command such as

datavar <- apply(data, 2, var)
datavar <- datavar[!duplicated(datavar)]

then assemble my data by matching the remaining column names, but this is SAD! I was hoping to do this in a single go. I was thinking of something like

data <- data[, which(apply(data, 2, function(col) !any(var(data) = any(var(data)) )))]

I know the last part of the above command is nonsense, but I also know there is someway it can be done in some... smart command!

Here's some data that applies to the question

data <- structure(list(V1 = c(3, 213, 1, 135, 5, 2323, 1231, 351, 1, 
33, 2, 213, 153, 132, 1321, 53, 1, 231, 351, 3135, 13), V2 = c(1, 
1, 1, 2, 3, 5, 13, 33, 53, 132, 135, 153, 213, 213, 231, 351, 
351, 1231, 1321, 2323, 3135), V3 = c(65, 41, 1, 53132, 1, 6451, 
3241, 561, 321, 534, 31, 135, 1, 1351, 31, 351, 31, 31, 3212, 
3132, 1), V4 = c(2, 2, 5, 4654, 5641, 21, 21, 1, 1, 465, 31, 
4, 651, 35153, 13, 132, 123, 1231, 321, 321, 5), V5 = c(23, 13, 
213, 135, 15341, 564, 564, 8, 464, 8, 484, 6546, 132, 165, 123, 
135, 132, 132, 123, 123, 2), V6 = c(2, 1, 84, 86468, 464, 18, 
45, 55, 2, 5, 12, 4512, 5, 123, 132465, 12, 456, 15, 45, 123213, 
12), V7 = c(1, 2, 2, 5, 5, 12, 12, 12, 15, 18, 45, 45, 55, 84, 
123, 456, 464, 4512, 86468, 123213, 132465)), .Names = c("V1", 
"V2", "V3", "V4", "V5", "V6", "V7"), row.names = c(NA, 21L), class = "data.frame")

Would I be able to keep one of the "similar variance" columns too?

Thanks,

Решение 2

This is pretty similar to what you've come up with:

vars <- lapply(data,var)
data[,which(sapply(1:length(vars), function(x) !vars[x] %in% vars[-x]))]

One thing to think about though is whether you want to match variances exactly (as in this example) or just variances that are close. The latter would be a significantly more challenging problem.

Другие советы

I might go a more cautious route, like

data[, !duplicated(round(sapply(data,var),your_precision_here))]

... or as alternative:

data[ , !c(duplicated(apply(data, 2, var)) | duplicated(apply(data, 2, var), fromLast=TRUE))]

...but also not shorter :)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow