Question

I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:

aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...

E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).

What I've tried so far is to import the lines and split them:

docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")

I figured I would put something like this in a loop:

doclist2 <- strsplit(doclist, ":", fixed=TRUE)

and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.

What I (think I) would need in the end is a matrix like this:

        doc1 doc2 doc3 doc4 ...
aword   3    0    0    0 
bword   2    4    0    0
cword:  15   20   0    0
dword   2    0    0    0
fword:  0    1    0    0
...

which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).

What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?

Was it helpful?

Solution

Here's an approach that gets you the output you say you might want:

## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons    
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x) 
  cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
                                dimnames = list(NULL, c("word", "count"))))))

## Convert to a data.frame
out <- data.frame(out)
out
#    document  word count
# 1 document1 aword     3
# 2 document1 bword     2
# 3 document1 cword    15
# 4 document1 dword     2
# 5 document2 bword     4
# 6 document2 cword    20
# 7 document2 fword     1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))

## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
#        document
# word    document1 document2
#   aword         3         0
#   bword         2         4
#   cword        15        20
#   dword         2         0
#   fword         0         1

Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top