Storing word frequency in huge dataset

Question 1

I assume that you are actually interested in developing your own algorithms, and not in the FULL TEXT capabilities of databases such as MySQL, SQL Server, Oracle, and so on.

The term document matrix -- the term that I know for this data structure -- would be stored with two columns as keys: DocumentID and TermID.

You might have additional columns for count of the term in the document, location in the document, or whatever, but those two keys are the standard way.

These would typically link to tables for the documents and the terms. The document table would typically have the number of terms in the document, the location (or text itself), and other information. The term table would typically have the weight of the term, and perhaps other information, such as synonym lists, part of speech, and so on.

Then when you want to add a new document, you just process the terms and add them in. Adding a new term . . . well, that requires processing the historical documents for the term, but that is still pretty easy.

Question 2

A more relational table design for this would look like this:

CREATE TABLE DOC_WORD_COUNTS AS
(
    DocID As INT Not NULL,
    Word As VARCHAR(20) Not NULL,
    Frequency As INT Not NULL
)

Then make (DocID + Word) as the primary key. You will also need another table to store the Documents' information, including the DocID.

Question 3

Create a data structure like this:

Document-Table: DocumentId (PK), DocumentName
Word-Table: WordId(PK), DocumentId(FK), WordName

That way you can run off some aggregate queries to report on the data.