Question

What is the best way to parse large texts (5000 words and more), searching names, that are stored in a database? The texts will be multi lingual.

My first idea is a rather naive approach, taking all words beginning with a big letter and compare them against the database. But this tends to fail in texts containing lowercase letters only.

Edit The texts are not static, but dynamic (e.g. web sites)

Best

Macs

Was it helpful?

Solution

OTHER TIPS

You can use the Aho-Corasick algorithm, and construct a dictionary with the names that you are trying to match. It's linear in the number of tokens in the text plus the number of matched names.

You will need a dictionary of names.

Or you can try http://www.opencalais.com/ that knows quite a large collection of names.

I made a method for replacing multiple strings in a large text here: A better way to replace many strings - obfuscation in C#. Perhaps you can use the same principle.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top