Remove variables with repeated occurences in 2 levels

https://stackoverflow.com/questions/18828214

28-06-2022
|

Question

I am trying to do an automated filtration to get rid of variables that are not useful. I was processing my data in a command that removes any value that get repeated more than "x" times in my table using this command

df <- df[, which(apply(df, 2, function(col) !any(table(col) > x)))]

I am trying now to apply the same thing but for 2 levels, here's what my data looks like

df <- structure(list(V1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 0, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), V2 = c(2, 2, 2, 2, 2, 2, 2, 
2, 0, 0, 0, 2, 2, 7, 2, 3, 4, 6, 4, 5, 2), V3 = c(0, 0, 0, 0, 
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1), level = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("V1", 
"V2", "V3", "level"), row.names = c(NA, 21L), class = "data.frame")

I would like to remove any variable that repeats the same value more than x times (5 times in this example) in both levels, A and B. My desired output is

df2 <- structure(list(V1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 
0L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), V2 = c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 0L, 0L, 0L, 2L, 5L, 7L, 2L, 3L, 4L, 
6L, 4L, 5L, 2L), level = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", 
"B"), class = "factor")), .Names = c("V1", "V2", "level"), class = "data.frame", row.names = c(NA, 
-21L))

I have thought of subset() the data according to the levels, perform my previous command and join them again, but this seemed a very long way. I cannot think of a proper command to do the job. Any ideas for a shorter command that would do this?

Thanks,

Solution

Use table on both columns to get a two way table and then use apply and see if any row in the resulting tables has all TRUE values (i.e. value appears more than x times....

#  Two column tables
lens <- lapply( df[ , -ncol(df) ] , function(x) table( x , df$level ) > 5 )

#  Which columns have ANY values that have more repeats in ALL levels
ind <- sapply( lens , function(x) ! any( apply( x , 1 , all ) ) )

#  Subset
df <- df[, ind ]

head( df )
  V1 V2 level
1  1  2     A
2  2  2     A
3  3  2     A
4  4  2     A
5  5  2     A
6  6  2     A

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow