Question

This continues from this question that I asked the other day (I now think I should've asked this at the same time).

Data

token.dt is a list consisting of data tables each of which corresponds to the n in n-grams and includes the n-grams (i.e., n sequence of words) and their scores.

> head(token.dt[[2]])
   V1  V2       mi2
1:  0   0  6.494179
2:  0 001 13.249067
3:  0 002 13.249067
4:  0 005 13.249067
5:  0 025 13.249067
6:  0 039 13.249067

> head(token.dt[[5]])
   V1  V2 V3       V4    V5       mi5
1:  0   0  1        0     1 10.353265
2:  0 001 in    apart   for  6.807743
3:  0 001 in    thick   and  5.190449
4:  0 002 on     each  side 11.688710
5:  0 005  m       in     f  9.940322
6:  0 025 in aluminum which  8.249075

Task

The task is to select the n-grams (i.e., rows of the tables in token.dt) that satisfy the following condition. The algorithm retains the n-gram only if its score is higher than the scores of the n-1 grams and the n+1 grams identified by the following way:

  • the n-1 grams that match the first n-1 words of the n-gram and
  • the n+1 grams whose first n words match the n-gram.

By way of example, consider the following.

> for (i in 2:n) setkeyv(token.dt[[i]], paste0("V", 1:i))
> token.dt[[2]][J("0", "1")]
   V1 V2      mi2
1:  0  1 7.135725

> token.dt[[3]][J("0", "1")]
   V1 V2        V3       mi3
1:  0  1         0  9.803035
2:  0  1         2  6.809646
3:  0  1         f  6.142258
4:  0  1         m  7.315181
5:  0  1 milligram 13.517241
6:  0  1        mv 13.517241
7:  0  1        of  1.151899
8:  0  1       the  0.214648
9:  0  1        to  3.633922

> token.dt[[4]][J("0", "1")]
    V1 V2        V3      V4       mi4
 1:  0  1         0       1 10.507784
 2:  0  1         2       3 11.541023
 3:  0  1         f     the  3.927859
 4:  0  1         m neutral 13.621798
 5:  0  1 milligram      of  3.852570
 6:  0  1 milligram     per 10.638304
 7:  0  1        mv       m 11.260860
 8:  0  1        of  making 12.235372
 9:  0  1       the  number  9.707556
10:  0  1        to       0 12.669723
11:  0  1        to       5 11.158356

Here, the trigram (sequence of three words) 0 1 0 is not retained because although the bigram that shares the first two words (0 1) has a lower score (9.803035 > 7.135725), the 4-gram whose first three words match the trigram (0 1 0 1) has a higher score than the trigram in concern (10.507784 > 9.803035).

The trigram 0 1 milligram is retained because its score is higher than the bigram that shares the first two words (13.517241 > 7.135725) and the two 4-grams whose first three words match the trigram (13.517241 > 3.852570, 13.517241 > 10.638304).

The task above is achieved non-programmatically in the following way.

> z <- token.dt[[4]][token.dt[[3]][token.dt[[2]], allow.cartesian = TRUE], list(k = all(mi3 > max(mi2, mi4)), mi3), allow.cartesian = TRUE][(k)]
> head(z)
   V1 V2        V3    k      mi3
1:  0  1 milligram TRUE 13.51724
2:  0  1        mv TRUE 13.51724
3:  0 15         g TRUE 12.24260
4:  0  2      gram TRUE 13.52079
5:  0  2     mrads TRUE 13.34449
6:  0  3        wt TRUE 13.28771

What I would like to know is how to do the above programmatically, that is, without hard-coding column names (e.g., mi3, mi4, etc.).

Failed Attempts

Simply creating character strings with the paste0 function and adding the argument of with = FALSE does not seem to work.

> i <- 3
> z <- token.dt[[i + 1]][token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], list(k = all(paste0("mi", i) > max(paste0("mi", i - 1), paste0("mi", i + 1))), paste0("mi", i)), with = FALSE, allow.cartesian = TRUE][(k)]
Error in abs(j) : non-numeric argument to mathematical function

Trying to evaluate the character strings above on the spot leads to the failure to find the columns. Adding envir = .SD to evals below led to the same error as below.

> z <- token.dt[[i + 1]][token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], list(k = all(eval(parse(text = paste0("mi", i))) > max(eval(parse(text = paste0("mi", i - 1))), eval(parse(text = paste0("mi", i + 1))))), eval(parse(text = paste0("mi", i)))), allow.cartesian = TRUE][(k)]
Error in eval(expr, envir, enclos) : object 'mi3' not found

The only way that works so far is by first concatenating necessary data tables and then following the same way above.

> for (j in 2:4) {
+   if (j == 2) {
+     all <- copy(token.dt[[j]])
+   } else {
+     all <- token.dt[[j]][all, allow.cartesian = TRUE]
+   }
+ }

> head(all)
   V1  V2 V3       V4       mi4       mi3       mi2
1:  0   0  1        0 13.292479  9.766820  6.494179
2:  0 001 in    apart 13.233742  5.624795 13.249067
3:  0 001 in    thick 13.005608  5.624795 13.249067
4:  0 002 on     each 10.416711  7.301489 13.249067
5:  0 005  m       in  5.625874 11.205271 13.249067
6:  0 025 in aluminum 13.443647  5.624795 13.249067

> z <- all[1:1000 , list(k = all(eval(parse(text = paste0("mi", i)), envir = .SD) > max(eval(parse(text = paste0("mi", i - 1)), envir = .SD), eval(parse(text = paste0("mi", i + 1)), envir = .SD))), mi = eval(parse(text = paste0("mi", i)), envir = .SD)), by = c(paste0("V", 1:i))][(k)]
> z <- unique(z)
> head(z)
   V1 V2        V3    k       mi
1:  0  1 milligram TRUE 13.51724
2:  0  1        mv TRUE 13.51724
3:  0 15         g TRUE 12.24260
4:  0  2      gram TRUE 13.52079
5:  0  2     mrads TRUE 13.34449
6:  0  3        wt TRUE 13.28771

However, this is unacceptably slow. Processing 1,000 rows (above) out of 970,696 rows takes more than five seconds. Given that the corpus I'm using here is much smaller than the corpus I want to apply the algorithm to, I am seeking ways to speed up the process.

Reproducible Example

The simulated data set below should work to illustrate the point.

token.dt <- list()
types <- combn(LETTERS, 3, paste, collapse = "")
set.seed(1)
data <- data.table(matrix(sample(types, 4 * 1E6, replace = TRUE), ncol = 4))
setkey(data, V1, V2, V3, V4)
set.seed(1)
for (n in 2:4) {
    token.dt[[n]] <- unique(cbind(data[ , 1:n, with = FALSE]))
    token.dt[[n]][ , paste0("mi", n) := runif(nrow(token.dt[[n]])) * 10]
}

Any suggestions are appreciated.

Was it helpful?

Solution

In order for the eval approach to work, you have to build the whole expression first, and then eval. I ran this on a reduced version of your sample (40 values instead of 4e6):

i <- 3
x <- parse(
  text=paste0(
    "list(k = all(mi", i, " > max(mi", i - 1, 
    ", mi", i + 1, ")), mi", i, ")"
) )  
token.dt[[i + 1]][
  token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], 
  eval(x),
  allow.cartesian = TRUE
][(k)]
#     V1  V2  V3    k      mi3
# 1: CIX BQV OWY TRUE 6.870228
# 2: GIU IJM HMO TRUE 7.698414
# 3: NQR FHN DOY TRUE 9.919061
# 4: PSX IPQ ACN TRUE 7.774452

As you can see, programatically referring to columns works. This ran in about 3 seconds on my system with your full data set (4MM values).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top