コーパスはフレーズで構築されています

https://stackoverflow.com//questions/24038498

21-12-2019
|

質問

私の文書を持っています：

 doc1 = very good, very bad, you are great
 doc2 = very bad, good restaurent, nice place to visit

私の最終的な,が次のようになるように、CorpusがDocumentTermMatrixと区切ったことを望みます。

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
  doc1       tf-idf          tf-idf         tf-idf          0                    0
  doc2       0                tf-idf         0                tf-idf             tf-idf

個々の単語のDocumentTermMatrixの計算方法がわかりますが、Corpus separated for each phraseをRにする方法がわかりません.Rの解決策が好ましいが、Pythonにおける解決策も歓迎されている。

私が試したことは：

> library(tm)
> library(RWeka)
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
> options(mc.cores=1)
> texts <- c("very good, very bad, you are great","very bad, good restaurent, nice place to visit")
> corpus <- Corpus(VectorSource(texts))
> a <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
> as.matrix(a)

私はget：

                         Docs
  Terms                   1 2
  bad good restaurent   0 1
  bad you are           1 0
  good restaurent nice  0 1
  good very bad         1 0
  nice place to         0 1
  place to visit        0 1
  restaurent nice place 0 1
  very bad good         0 1
  very bad you          1 0
  very good very        1 0
  you are great         1 0

私が欲しいのは言葉の組み合わせではなく、私がマトリックスに見せたフレーズだけです。

解決

qdap + tmパッケージを使用した1つのアプローチ：

library(qdap); library(tm); library(qdapTools)

dat <- list2df(list(doc1 = "very good, very bad, you are great",
 doc2 = "very bad, good restaurent, nice place to visit"), "text", "docs")

x <- sub_holder(", ", dat$text)

m <- dtm(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs) )
weightTfIdf(m)

inspect(weightTfIdf(m))

## A document-term matrix (2 documents, 5 terms)
## 
## Non-/sparse entries: 4/6
## Sparsity           : 60%
## Maximal term length: 19 
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##       Terms
## Docs   good restaurent nice place to visit very bad very good you are great
##   doc1       0.0000000           0.0000000        0 0.3333333     0.3333333
##   doc2       0.3333333           0.3333333        0 0.0000000     0.0000000

あなたはまた急降下した急降下してDocumentTermMatrixを返すこともできますが、これは理解するのが難しいかもしれません：

x <- sub_holder(", ", dat$text)

apply_as_tm(t(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs)), 
    weightTfIdf, to.qdap=FALSE)

他のヒント

StrSplitを使用してコンマに分割してから、いくつかの文字と組み合わせることであなたのフレーズを単一の「単語」に変えた場合たとえば

です

library(tm)
docs <- c(D1 = "very good, very bad, you are great", 
    D2 = "very bad, good restaurent, nice place to visit")

dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
    PlainTextDocument(
       gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
       id=ID(x)
     )
})
inspect(dd)

# A corpus with 2 text documents
# 
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# $D1
# very~good
# very~bad
# you~are~great
# 
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

これは

を生成します

# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
#   D1       0.0000000           0.0000000        0 0.3333333     0.3333333
#   D2       0.3333333           0.3333333        0 0.0000000     0.0000000

text2vecを使用している人は、カスタム語彙に基づく非常に便利なソリューションです。

library(text2vec)
doc1 <- 'very good, very bad, you are great'
doc2 <- 'very bad, good restaurent, nice place to visit'
docs <- list(doc1, doc2)
docs <- sapply(docs, strsplit, split=', ')
vocab <- vocab_vectorizer(create_vocabulary(unique(unlist(docs))))
dtm <- create_dtm(itoken(docs), vocab)
dtm

これには次のようになります。

2 x 5 sparse Matrix of class "dgCMatrix"
  very good very bad you are great good restaurent nice place to visit
1         1        1             1               .                   .
2         .        1             .               1                   1

そのようなアプローチでは、ファイルのロードとワクバブラリの準備のより多くのカスタマイズが可能です。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow