First column of csv file as document number in calculating Document-Term matrix in R

https://stackoverflow.com//questions/24035531

21-12-2019
|

Question

My data.csv file contains the following:

id,name
143,The sky is blue.
21,The sun is bright.
23,The sun in the sky is bright.

Now, I can read the whole file like this:

> file_loc <- "test.csv"
> x <- read.csv(file_loc, header = TRUE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)

> require(tm)
  Loading required package: tm

> dd <- Corpus(DataframeSource(x))
> dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))

The resultant matrix I am getting is:

> as.matrix(dtm)
                Terms
      Docs       143     blue.   bright.       sky       sun the
      1 0.3962406 0.3962406 0.0000000 0.1462406 0.0000000   0
      2 0.0000000 0.0000000 0.1949875 0.0000000 0.1949875   0
      3 0.0000000 0.0000000 0.1169925 0.1169925 0.1169925   0

What I want is to make the id column of the csv file as the name of the docs like this:

                 Terms
      Docs      blue.   bright.       sky       sun the
      143 0.3962406 0.0000000 0.1462406 0.0000000   0
      21 0.0000000 0.1949875 0.0000000 0.1949875   0
      23 0.0000000 0.1169925 0.1169925 0.1169925   0

Can anybody guide as to how can I achieve the desired result?

Solution

Here's an approach with qdap + tm:

library(qdap)

dat <- read.transcript(text="id,name
143,The sky is blue.
21,The sun is bright.
23,The sun in the sky is bright.", sep=",", header=TRUE)

## dat <- read.transcript("test.csv", sep=",")

dd <- df2tm_corpus(dat[, 2], dat[, 1])
library(tm)
as.matrix(DocumentTermMatrix(dd, control = list(weighting = weightTfIdf)))

##      Terms
## Docs      blue.   bright.       sky       sun the
##   143 0.5283208 0.0000000 0.1949875 0.0000000   0
##   21  0.0000000 0.1949875 0.0000000 0.1949875   0
##   23  0.0000000 0.1169925 0.1169925 0.1169925   0

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow