This continues from this question that I asked the other day (I now think I should've asked this at the same time).
Data
token.dt
is a list consisting of data tables each of which corresponds to the n in n-grams and includes the n-grams (i.e., n sequence of words) and their scores.
> head(token.dt[[2]])
V1 V2 mi2
1: 0 0 6.494179
2: 0 001 13.249067
3: 0 002 13.249067
4: 0 005 13.249067
5: 0 025 13.249067
6: 0 039 13.249067
> head(token.dt[[5]])
V1 V2 V3 V4 V5 mi5
1: 0 0 1 0 1 10.353265
2: 0 001 in apart for 6.807743
3: 0 001 in thick and 5.190449
4: 0 002 on each side 11.688710
5: 0 005 m in f 9.940322
6: 0 025 in aluminum which 8.249075
Task
The task is to select the n-grams (i.e., rows of the tables in token.dt
) that satisfy the following condition. The algorithm retains the n-gram only if its score is higher than the scores of the n-1 grams and the n+1 grams identified by the following way:
- the n-1 grams that match the first n-1 words of the n-gram and
- the n+1 grams whose first n words match the n-gram.
By way of example, consider the following.
> for (i in 2:n) setkeyv(token.dt[[i]], paste0("V", 1:i))
> token.dt[[2]][J("0", "1")]
V1 V2 mi2
1: 0 1 7.135725
> token.dt[[3]][J("0", "1")]
V1 V2 V3 mi3
1: 0 1 0 9.803035
2: 0 1 2 6.809646
3: 0 1 f 6.142258
4: 0 1 m 7.315181
5: 0 1 milligram 13.517241
6: 0 1 mv 13.517241
7: 0 1 of 1.151899
8: 0 1 the 0.214648
9: 0 1 to 3.633922
> token.dt[[4]][J("0", "1")]
V1 V2 V3 V4 mi4
1: 0 1 0 1 10.507784
2: 0 1 2 3 11.541023
3: 0 1 f the 3.927859
4: 0 1 m neutral 13.621798
5: 0 1 milligram of 3.852570
6: 0 1 milligram per 10.638304
7: 0 1 mv m 11.260860
8: 0 1 of making 12.235372
9: 0 1 the number 9.707556
10: 0 1 to 0 12.669723
11: 0 1 to 5 11.158356
Here, the trigram (sequence of three words) 0 1 0 is not retained because although the bigram that shares the first two words (0 1) has a lower score (9.803035 > 7.135725), the 4-gram whose first three words match the trigram (0 1 0 1) has a higher score than the trigram in concern (10.507784 > 9.803035).
The trigram 0 1 milligram is retained because its score is higher than the bigram that shares the first two words (13.517241 > 7.135725) and the two 4-grams whose first three words match the trigram (13.517241 > 3.852570, 13.517241 > 10.638304).
The task above is achieved non-programmatically in the following way.
> z <- token.dt[[4]][token.dt[[3]][token.dt[[2]], allow.cartesian = TRUE], list(k = all(mi3 > max(mi2, mi4)), mi3), allow.cartesian = TRUE][(k)]
> head(z)
V1 V2 V3 k mi3
1: 0 1 milligram TRUE 13.51724
2: 0 1 mv TRUE 13.51724
3: 0 15 g TRUE 12.24260
4: 0 2 gram TRUE 13.52079
5: 0 2 mrads TRUE 13.34449
6: 0 3 wt TRUE 13.28771
What I would like to know is how to do the above programmatically, that is, without hard-coding column names (e.g., mi3, mi4, etc.).
Failed Attempts
Simply creating character strings with the paste0
function and adding the argument of with = FALSE
does not seem to work.
> i <- 3
> z <- token.dt[[i + 1]][token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], list(k = all(paste0("mi", i) > max(paste0("mi", i - 1), paste0("mi", i + 1))), paste0("mi", i)), with = FALSE, allow.cartesian = TRUE][(k)]
Error in abs(j) : non-numeric argument to mathematical function
Trying to evaluate the character strings above on the spot leads to the failure to find the columns. Adding envir = .SD
to eval
s below led to the same error as below.
> z <- token.dt[[i + 1]][token.dt[[i]][token.dt[[i - 1]], allow.cartesian = TRUE], list(k = all(eval(parse(text = paste0("mi", i))) > max(eval(parse(text = paste0("mi", i - 1))), eval(parse(text = paste0("mi", i + 1))))), eval(parse(text = paste0("mi", i)))), allow.cartesian = TRUE][(k)]
Error in eval(expr, envir, enclos) : object 'mi3' not found
The only way that works so far is by first concatenating necessary data tables and then following the same way above.
> for (j in 2:4) {
+ if (j == 2) {
+ all <- copy(token.dt[[j]])
+ } else {
+ all <- token.dt[[j]][all, allow.cartesian = TRUE]
+ }
+ }
> head(all)
V1 V2 V3 V4 mi4 mi3 mi2
1: 0 0 1 0 13.292479 9.766820 6.494179
2: 0 001 in apart 13.233742 5.624795 13.249067
3: 0 001 in thick 13.005608 5.624795 13.249067
4: 0 002 on each 10.416711 7.301489 13.249067
5: 0 005 m in 5.625874 11.205271 13.249067
6: 0 025 in aluminum 13.443647 5.624795 13.249067
> z <- all[1:1000 , list(k = all(eval(parse(text = paste0("mi", i)), envir = .SD) > max(eval(parse(text = paste0("mi", i - 1)), envir = .SD), eval(parse(text = paste0("mi", i + 1)), envir = .SD))), mi = eval(parse(text = paste0("mi", i)), envir = .SD)), by = c(paste0("V", 1:i))][(k)]
> z <- unique(z)
> head(z)
V1 V2 V3 k mi
1: 0 1 milligram TRUE 13.51724
2: 0 1 mv TRUE 13.51724
3: 0 15 g TRUE 12.24260
4: 0 2 gram TRUE 13.52079
5: 0 2 mrads TRUE 13.34449
6: 0 3 wt TRUE 13.28771
However, this is unacceptably slow. Processing 1,000 rows (above) out of 970,696 rows takes more than five seconds. Given that the corpus I'm using here is much smaller than the corpus I want to apply the algorithm to, I am seeking ways to speed up the process.
Reproducible Example
The simulated data set below should work to illustrate the point.
token.dt <- list()
types <- combn(LETTERS, 3, paste, collapse = "")
set.seed(1)
data <- data.table(matrix(sample(types, 4 * 1E6, replace = TRUE), ncol = 4))
setkey(data, V1, V2, V3, V4)
set.seed(1)
for (n in 2:4) {
token.dt[[n]] <- unique(cbind(data[ , 1:n, with = FALSE]))
token.dt[[n]][ , paste0("mi", n) := runif(nrow(token.dt[[n]])) * 10]
}
Any suggestions are appreciated.