Pulling information from the first approximate match of a text string in R (and summing the total number of matches)
-
07-07-2021 - |
Question
I'm having trouble summing approximate matches of text strings, as well as pulling information from the string that was matched first in time.
I have data that look like this:
text<-c("THEN it goes West","AT it falls East","it goes West", "it falls East", "AT it goes West")
date<-c(2008,2009,2003,2006,2011)
ID<-c(1,2,3,4,5)
data<-cbind(text,date,ID)
data<-as.data.frame(data)
Notice that the latest text strings have all-caps "THEN" and "AT" added to the earlier text strings.
I would like a table that looks like this:
ID Sum Originaltext Originaldate
[1,] "4" "3" "it goes West" "2003"
[2,] "2" "2" "it falls East" "2006"
This includes:
The ID number corresponding with the text with the earliest date (the "original" text that the others were derived from). Sums of all approximate matches for each. The text corresponding with the earliest date. And the date of the text corresponding with the earliest date.
I have tens of millions of cases, so I'm having trouble automating the process.
I run Windows 7, and have access to fast-computing servers.
IDEAS
#order them backwards in time
data<-data[order(data$date, decreasing = TRUE),]
#find the strings with the latest date
pattern<-"AT|THEN"
k <- vector("list", length(data$text))
for (j in 1:length(data$text)){
k[[j]]<- grep(pattern,data$text[[j]], ignore.case=FALSE)
}
k<-subset(data$text, k==1)
k<-unique(k)
#this is a problem, because case nos. 1 and 5 are still in the dataset, but they derive from the same tweet.
From here, I can use "agrep", but I'm not sure in what context. Any help would be greatly appreciated!
NOTE: While the three answers below do answer my question the way I originally asked it, I have not mentioned that my text cases do vary even without the words "AT" and "THEN". In fact, most of them do not match exactly. I should have put this in the original question. However, I would still love an answer.
Thanks!
Solution
A data.table
solution avoiding stringr
. I am sure this could be improved
Dealing with text data
# make the factor columns character
.data <- lapply(data, function(x) if(is.factor(x)) {as.character(x)} else { x})
library(data.table)
DT <- as.data.table(.data)
DT[, original_text := text]
# using `%like% which is an easy data.table wrapper for grepl
DT[text %like% "^THEN", text := substr(text, 6, nchar(text))]
DT[text %like% "^AT", text := substr(text, 4, nchar(text))]
# or avoiding the two vector scans and replacing in one fell swoop
DT[,text := gsub('(^THEN )|(^AT )', '', text)]
DT[, c(sum=.N, .SD[which.min(date)]) ,by=text]
using factor levels (could be faster)
# assuming that text is a factor
DTF <- as.data.table(data)
DTF[, original_text := text]
levels_text <- DTF[, levels(text)]
new_levels <- gsub('(^THEN )|(^AT )', x= levels_text ,'')
# reset the levels
setattr(DTF[['text']], 'levels', new_levels)
# coerce to character and do the same count / min date
DTF[, c(sum=.N, .SD[which.min(date)]) ,by=list(text = as.character(text))]
OTHER TIPS
I'm going to give you a base solution but I really think this is a big problem for base and the data.table
package is what is needed (but I don't know how to use data.table very well:
dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]),
Original.Date = as.character(x[1, 2]))
}
data.frame(do.call(rbind, lapply(dat2, FUN)), row.names = NULL)
I don't really know how close each text string is so maybe my exact matching is not appropriate but if that's the case use agrep
to develop a new variable. Sorry for the lack of annotations but I am pressed for time and I think data.table
is more appropriate anyway.
EDIT: I still think that data.table is better and should be out the door but maybe running in parallel is smart. You're on a windows machine so this would work to use multiple cores of a computer:
dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]),
Original.Date = as.character(x[1, 2]))
}
library(parallel)
detectCores() #make sure you have > 1 core
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("FUN", "dat2"), envir=environment())
x <- parLapply(cl, dat2, FUN)
stopCluster(cl) #stop the cluster
data.frame(do.call(rbind, x), row.names = NULL)
plyr
might be too slow given the number of records you mention, but here is a solution for you:
library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))
result <- ddply(data, .(text), function(x) {
sum <- nrow(x)
x <- x[which(x$date==min(x$date)),]
return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
})
> result[, -1]
id Sum Originaltext Originaldate
1 4 2 it falls East 2006
2 3 3 it goes West 2003
If you have access to a multicore machine (4 or more cores), then here is a HPC solution
library(multicore)
library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))
fux <- function(foo) {
sum <- nrow(x)
x <- x[which(x$date==min(x$date)),]
return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
}
x <- split(data, data$text)
result <- mclapply(x, fux, mc.cores = 4, mc.preschedule = TRUE)