سؤال

I am working on a document clustering problem, and to do so I need to get the word frequency of a document's dataset.

At the moment, I'm using a trivial approach : I create a word table and I add as many columns as the number of documents the dataset contained, obtaining something like

word | document1 | document2 | ... | document n |

This approach, even if kind of slow, works for small datasets ( containing less han 100 documents ). The problem is that now I must deal with huge ones, containing 700+ documents each, and I feel like there must be a smarter way to deal with it: the problem is, I can't think of anything else.

So, the question is : how can I efficiently keep track of the word frequency per document?

PS: Consider that both the number of words per document or the dataset size are unknown, but a reasonable upper bound should be 2000 words per document, and 2000 documents per dataset.

هل كانت مفيدة؟

المحلول

I assume that you are actually interested in developing your own algorithms, and not in the FULL TEXT capabilities of databases such as MySQL, SQL Server, Oracle, and so on.

The term document matrix -- the term that I know for this data structure -- would be stored with two columns as keys: DocumentID and TermID.

You might have additional columns for count of the term in the document, location in the document, or whatever, but those two keys are the standard way.

These would typically link to tables for the documents and the terms. The document table would typically have the number of terms in the document, the location (or text itself), and other information. The term table would typically have the weight of the term, and perhaps other information, such as synonym lists, part of speech, and so on.

Then when you want to add a new document, you just process the terms and add them in. Adding a new term . . . well, that requires processing the historical documents for the term, but that is still pretty easy.

نصائح أخرى

A more relational table design for this would look like this:

CREATE TABLE DOC_WORD_COUNTS AS
(
    DocID As INT Not NULL,
    Word As VARCHAR(20) Not NULL,
    Frequency As INT Not NULL
) 

Then make (DocID + Word) as the primary key. You will also need another table to store the Documents' information, including the DocID.

Create a data structure like this:

  • Document-Table: DocumentId (PK), DocumentName
  • Word-Table: WordId(PK), DocumentId(FK), WordName

That way you can run off some aggregate queries to report on the data.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top