Pulling information from the first approximate match of a text string in R (and summing the total number of matches)

https://stackoverflow.com/questions/12903368

07-07-2021
|

Question

I'm having trouble summing approximate matches of text strings, as well as pulling information from the string that was matched first in time.

I have data that look like this:

text<-c("THEN it goes West","AT it falls East","it goes West", "it falls East", "AT it goes West")
date<-c(2008,2009,2003,2006,2011)
ID<-c(1,2,3,4,5)
data<-cbind(text,date,ID)
data<-as.data.frame(data)

Notice that the latest text strings have all-caps "THEN" and "AT" added to the earlier text strings.

I would like a table that looks like this:

     ID  Sum Originaltext     Originaldate
[1,] "4" "3" "it goes West"      "2003"      
[2,] "2" "2" "it falls East"     "2006"

This includes:

The ID number corresponding with the text with the earliest date (the "original" text that the others were derived from). Sums of all approximate matches for each. The text corresponding with the earliest date. And the date of the text corresponding with the earliest date.

I have tens of millions of cases, so I'm having trouble automating the process.

I run Windows 7, and have access to fast-computing servers.

IDEAS

#order them backwards in time
data<-data[order(data$date, decreasing = TRUE),]

#find the strings with the latest date

pattern<-"AT|THEN"

k <- vector("list", length(data$text))

 for (j in 1:length(data$text)){
     k[[j]]<- grep(pattern,data$text[[j]], ignore.case=FALSE)
}

k<-subset(data$text, k==1)

k<-unique(k)

#this is a problem, because case nos. 1 and 5 are still in the dataset, but they derive from the same tweet.

From here, I can use "agrep", but I'm not sure in what context. Any help would be greatly appreciated!

NOTE: While the three answers below do answer my question the way I originally asked it, I have not mentioned that my text cases do vary even without the words "AT" and "THEN". In fact, most of them do not match exactly. I should have put this in the original question. However, I would still love an answer.

Thanks!

Solution

A data.table solution avoiding stringr. I am sure this could be improved

Dealing with text data

# make the factor columns character
.data <- lapply(data, function(x) if(is.factor(x)) {as.character(x)} else { x})
library(data.table)
DT <- as.data.table(.data)


DT[, original_text := text]
# using `%like% which is an easy data.table wrapper for grepl
DT[text %like% "^THEN", text := substr(text, 6, nchar(text))]
DT[text %like% "^AT", text :=  substr(text, 4, nchar(text))]

# or avoiding the two vector scans and replacing in one fell swoop
DT[,text := gsub('(^THEN )|(^AT )', '', text)]

DT[, c(sum=.N, .SD[which.min(date)]) ,by=text]

using factor levels (could be faster)

# assuming that text is a factor
DTF <- as.data.table(data) 
DTF[, original_text := text]
levels_text <- DTF[, levels(text)]
new_levels <- gsub('(^THEN )|(^AT )', x= levels_text ,'')
# reset the levels
setattr(DTF[['text']], 'levels', new_levels)
# coerce to character and do the same count / min date
DTF[, c(sum=.N, .SD[which.min(date)]) ,by=list(text = as.character(text))]

OTHER TIPS

I'm going to give you a base solution but I really think this is a big problem for base and the data.table package is what is needed (but I don't know how to use data.table very well:

dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
    c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]), 
        Original.Date = as.character(x[1, 2]))
}

data.frame(do.call(rbind, lapply(dat2, FUN)), row.names = NULL)

I don't really know how close each text string is so maybe my exact matching is not appropriate but if that's the case use agrep to develop a new variable. Sorry for the lack of annotations but I am pressed for time and I think data.table is more appropriate anyway.

EDIT: I still think that data.table is better and should be out the door but maybe running in parallel is smart. You're on a windows machine so this would work to use multiple cores of a computer:

dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
    c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]), 
        Original.Date = as.character(x[1, 2]))
}

library(parallel)
detectCores()  #make sure you have > 1 core

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("FUN", "dat2"), envir=environment())
x <- parLapply(cl, dat2, FUN)
stopCluster(cl)  #stop the cluster
data.frame(do.call(rbind, x), row.names = NULL)

plyr might be too slow given the number of records you mention, but here is a solution for you:

library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))

result <- ddply(data, .(text), function(x) {
     sum <- nrow(x)
     x <- x[which(x$date==min(x$date)),]
    return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
    })

> result[, -1]
  id Sum  Originaltext Originaldate
1  4   2 it falls East         2006
2  3   3  it goes West         2003

If you have access to a multicore machine (4 or more cores), then here is a HPC solution

library(multicore)
library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))

fux <- function(foo) {
     sum <- nrow(x)
     x <- x[which(x$date==min(x$date)),]
    return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
}

x <- split(data, data$text)
result <- mclapply(x, fux, mc.cores = 4, mc.preschedule = TRUE)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow