First column of csv file as document number in calculating Document-Term matrix in R
-
21-12-2019 - |
Question
My data.csv
file contains the following:
id,name
143,The sky is blue.
21,The sun is bright.
23,The sun in the sky is bright.
Now, I can read the whole file like this:
> file_loc <- "test.csv"
> x <- read.csv(file_loc, header = TRUE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
> require(tm)
Loading required package: tm
> dd <- Corpus(DataframeSource(x))
> dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
The resultant matrix I am getting is:
> as.matrix(dtm)
Terms
Docs 143 blue. bright. sky sun the
1 0.3962406 0.3962406 0.0000000 0.1462406 0.0000000 0
2 0.0000000 0.0000000 0.1949875 0.0000000 0.1949875 0
3 0.0000000 0.0000000 0.1169925 0.1169925 0.1169925 0
What I want is to make the id
column of the csv
file as the name of the docs
like this:
Terms
Docs blue. bright. sky sun the
143 0.3962406 0.0000000 0.1462406 0.0000000 0
21 0.0000000 0.1949875 0.0000000 0.1949875 0
23 0.0000000 0.1169925 0.1169925 0.1169925 0
Can anybody guide as to how can I achieve the desired result?
Solution
Here's an approach with qdap
+ tm
:
library(qdap)
dat <- read.transcript(text="id,name
143,The sky is blue.
21,The sun is bright.
23,The sun in the sky is bright.", sep=",", header=TRUE)
## dat <- read.transcript("test.csv", sep=",")
dd <- df2tm_corpus(dat[, 2], dat[, 1])
library(tm)
as.matrix(DocumentTermMatrix(dd, control = list(weighting = weightTfIdf)))
## Terms
## Docs blue. bright. sky sun the
## 143 0.5283208 0.0000000 0.1949875 0.0000000 0
## 21 0.0000000 0.1949875 0.0000000 0.1949875 0
## 23 0.0000000 0.1169925 0.1169925 0.1169925 0
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow