Pergunta

I am using this gensim tutorial to find similarities between texts. Here is the code

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

'''
documents = ["Human machine interface for lab abc computer applications",
              "bags loose tea water second ingredient tastes water",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey",
              "red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
              "bags loose tea water second ingredient tastes water"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

#print corpus

tfidf = models.TfidfModel(corpus)

#print tfidf

corpus_tfidf = tfidf[corpus]

#print corpus_tfidf

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)

corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi

index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')

sims = index[vec_lsi]
#print list(enumerate(sims))

sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
  print documents[sim[0]], " ==> ", sim[1]

There are two documents here. One has 10 texts and another has 2. One is commented out. If I use the first documents list everything goes fine and generates meaningful output. If I use the second document list(having 2 texts) an error occured. Here is it

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name )

What is the reason behind this error and how can I fix it? I am using a 64bit machine.

Foi útil?

Solução

It could be caused by the fact that your second list will be [[], ['water']] by the time that you have removed singletons, trying to do matrix operations on matrices with dimensions of 0 and 1 could cause all sorts of issues.

Having a play with your code:

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpus
[[], [(0, 2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:23:31,415 : INFO : collecting document frequencies
2013-07-21 09:23:31,415 : INFO : PROGRESS: processing document #0
2013-07-21 09:23:31,415 : INFO : calculating IDF weights for 2 documents and 1 features (1 matrix non-zeros)
>>> corpus = [[(1,)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:16,452 : INFO : collecting document frequencies
2013-07-21 09:24:16,452 : INFO : PROGRESS: processing document #0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__
    self.initialize(corpus)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 119, in initialize
    for termid, _ in bow:
ValueError: need more than 1 value to unpack
>>> corpus = [[(1,3)], [(0,2)]]
>>> tfidf = models.TfidfModel(corpus)
2013-07-21 09:24:26,892 : INFO : collecting document frequencies
2013-07-21 09:24:26,892 : INFO : PROGRESS: processing document #0
2013-07-21 09:24:26,892 : INFO : calculating IDF weights for 2 documents and 2 features (2 matrix non-zeros)
>>> 

As I said above you need to ensure that corpus does not have any empty lists before calling models.TfidfModel(corpus) on it.

Outras dicas

That's not an error, it's a warning. You can ignore it.

Your query document doc is empty in the second case, which causes the warning. You still get the correct answer though anyway (=an empty vec_lsi).

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top