Cluster text documents in database

https://stackoverflow.com/questions/15805643

01-04-2022
|

Question

I do have 20.000 text files loaded in PostgreSQL database, one file in one row, all stored in table named docs with columns doc_id and doc_content.

I know that there is approximately 8 types of documents. Here are my questions:

How can I find these groups?
Are there some similarity, dissimilarity measures I can use?
Is there some implementation of longest common substring in PostgreSQL?
Are there some extensions for text mining in PostgreSQL? (I've found only Tsearch, but this seems to be last updated in 2007)

I can probably use some like '%%' or SIMILAR TO, but there might be better approach.

Solution

You should use full text search, which is part of PostgreSQL 9.x core (aka Tsearch2).

For some kind of measure of longest common substring (or similarity if you will), you might be able to use levenshtein() function - part of fuzzystrmatch extension.

OTHER TIPS

You can use a clustering technique such as K-Means or Hierarchical Clustering.
Yes you can use the Cosine similarity between documents, looking at the binary term count, term counts, term frequencies, or TF-IDF frequencies.
I don't know about that one.
Not sure, but you could use R or RapidMiner to do the data mining against your database.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow