Question

I have 2 problems:

  1. to design a dynamic R code function that receives a parameters N (integer) and L (list of characters) and do the following: construct a predicate with OR operators dynamically as the number N. For instance, if N=2, the predicate would be:data.clean.test[j,c(L[[1]])] == TRUE OR data.clean.test[j,c(L[[2]])] or if N=3 it would be data.clean.test[j,c(L[[1]])] == TRUE OR data.clean.test[j,c(L[[2]])] == TRUE OR data.clean.test[j,c(L[[3]])] == TRUE and so on...

  2. select top N results from an un sorted list of decimals (probabilities between 0 and 1)

any ideas ? this is not homework but a real predictive analysis use case...

Was it helpful?

Solution

Assuming your data looks somewhat like this

set.seed(104)
dd<-data.frame(
  a=sample(c(T,F),25, replace=T),
  b=sample(c(T,F),25, replace=T),
  c=sample(c(T,F),25, replace=T),
  d=sample(c(T,F),25, replace=T),
  prob = runif(25)
)

collist<-list("a","c","b")

then a function that would do what you want in part one is

myfun<-function(N) {
    rowmatches <- apply(as.matrix(dd[, unlist(collist[1:N])]), 1, any)
    dd[rowmatches, ]
}

There is no need to dynamically build a predicate list. Here we just extract the columns you are asking for from the data.set and turn it into a matrix. Then we use apply to scan across the values in the row to see if any are true. Then we returns the rows that match. So

myfun(1)
# nrow(myfun(1)) == sum(dd$a==T)
# TRUE

returns all the rows where column a is true. And

myfun(2)
# nrow(myfun(2)) == sum(dd$a==T | dd$c==T)
# TRUE

returns all rows where column "a" or "c" is true.

Then, if you want to grab the top values in the list, you can do something like

result<-myfun(2)
head(result[order(result$prob),], 3)
#       a    b     c     d       prob
#15 FALSE TRUE  TRUE FALSE 0.08670653
#14  TRUE TRUE FALSE FALSE 0.12188057
#16  TRUE TRUE  TRUE  TRUE 0.13206675

where you use order() to sort the data.frame and use head() to extract a certain number of rows (in this case 3).

OTHER TIPS

Perhaps ... guessing that data.clean.test is a function rather than a data object:

any( sapply( L , data.clean.test, j)

Or if that guess is wrong and "j" is a constant in your workspace, then:

any( sapply( L, function(x) data.clean.test[ j, x] )

The any function will test for any TRUE (or coercible to TRUE) values. This means the either "==" or any will give a truth"-value of TRUE for numeric values not equal to 0 or to logical TRUE.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top