Given a list of dozens of words, how do I find the best matching sections from a corpus of hundreds of texts?

https://stackoverflow.com/questions/15744640

31-03-2022
|

Question

Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).

My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?

I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.

The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.

Solution

Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow