Question

I have a data in R as follows

Text <- c("reuce FR563 323 aldk", "vard 432", "DK123 fg4d", "matten global height")
ID <- c("S1", "S2", "S3", "S4")
data <- data.frame(ID, Text)
data$noofwords <- sapply(data$Text, str_count,"[[:space:]]") +1
data$Text <- as.character(data$Text)
data$ID <- as.character(data$ID)


data
ID                 Text noofwords
1 S1 reuce FR563 323 aldk         4
2 S2             vard 432         2
3 S3           DK123 fg4d         2
4 S4 matten global height         3

I want to fetch every word in a string in Text column into a new data.frame in R along with the corresponding ID and Text field

The following script with nested for loops does the job, but is there any way to vectorise it? It is very slow for large datasets.

keyword <- "keyword"
text <- "text"
ID <- "ID"
Index <- data.frame(keyword,text,ID)
Index[,1:3] <- as.character(Index[,1:3])

n <- nrow(data)
for (i in 1:n) {
  k <- data[i,"noofwords"]
  kwv <- str_split(data[i,"Text"], " ", n = Inf)
  kwv <- unlist(kwv, recursive = TRUE, use.names = FALSE)
  for (j in 1:k){
    kw <- kwv[j]
    tex <- (data[i,"Text"])
    nid <- (data[i, "ID"])
    Index <- rbind(Index, c(kw,tex,nid))
  }
}


Index
   keyword                 text ID
1        1                    1  1
2    reuce reuce FR563 323 aldk S1
3    FR563 reuce FR563 323 aldk S1
4      323 reuce FR563 323 aldk S1
5     aldk reuce FR563 323 aldk S1
6     vard             vard 432 S2
7      432             vard 432 S2
8    DK123           DK123 fg4d S3
9     fg4d           DK123 fg4d S3
10  matten matten global height S4
11  global matten global height S4
12  height matten global height S4

Also why is there an extra first row with all 1s getting created?

Was it helpful?

Solution

This uses the data.table package and should be relatively quick. Do check your column types because the example data you gave gets converted to a factor variable (so I used stringsAsFactors=FALSE when recreating it).

require(data.table)
dt <- data.table( data , key = "ID" )
dt[ dt[ , list( Keyword = unlist( strsplit( Text , " " ) ) ) , by = ID ] ]
#    ID                 Text Keyword
# 1: S1 reuce FR563 323 aldk   reuce
# 2: S1 reuce FR563 323 aldk   FR563
# 3: S1 reuce FR563 323 aldk     323
# 4: S1 reuce FR563 323 aldk    aldk
# 5: S2             vard 432    vard
# 6: S2             vard 432     432
# 7: S3           DK123 fg4d   DK123
# 8: S3           DK123 fg4d    fg4d
# 9: S4 matten global height  matten
#10: S4 matten global height  global
#11: S4 matten global height  height
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top