Disambiguating Named Entities in Java

https://stackoverflow.com/questions/3007186

26-09-2019
|

Question

I have a list of strings (company names, in this case), and a Java program that extracts a list of things that look like company names out of mostly-unstructured text. I need to match each element of extracted text to a string in the list. Caveat: the unstructured text has typos, things like "Blah, Inc." referred to as "Blah," etc. I've tried Levenshtein Edit Distance, but that fails for predictable reasons. Are there known best-practices ways of tackling this problem? Or am I back to manual data-entry?

Solution

This is not a simple problem, and there are entire companies built around trying to solve it (even for reduced matching sets like company names versus the general case).

If you can identify a discrete number of patterns that valid company names fall into, and that noise does not fall into, then you could tackle this with a series of regular expression matches.

If the patterns are difficult or too numerous, then you could try developing a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for a quick validation, and grow the network. Techniques might include genetic programming or setting up a neural network. This approach is obviously not lightweight, and you'd probably want to consider your need carefully before going down this road.

OTHER TIPS

You might want to have a look at Apache Stanbol, it plugs together NER engines (I think one is based on a gazetteer you supply) and linking engines to resolve your detected entities. I haven't used it myself and it's still in incubation, but might suit what you're looking for.

There is also a bit of research in this space in the TAC Knowledge Base Population track (Entity Linking). The task pops-up in different places and you should also have luck in conferences like ACL, EMNLP, SIGIR, etc (this list is by no means complete).

The TAC systems link to a subset of Wikipedia, which might help with your name variation since pages have "redirects", which are essentially aliases for a particular page.

For example, following pages redirect to "Apple Inc.", but you probably want to extract the redirects either from a raw Wikipedia dump or from a clean source like DBPedia or Freebase.

AAPL
Apple Company
Apple Computer
Apple Computer Co.
Apple Computer Inc.
Apple Computer Incorporated
Apple Computer, Inc
Apple Computer, Inc.
Apple Inc
Apple Incorporate
Apple Incorporated
Apple compputer
Apple computer Inc
Apple inc
Apple inc.
...

In the work we do at my company we deal with this type of problem all the time. The most successful efforts I have seen use just a few pages of Python code. Python is excellent for string dissection and analysis, and you can call a Python routine from your Java program. Like Greg said, the right answer is highly dependent on the quality of your unstructured text. A good way to start is to quantitatively characterize how it aligns with your golden text. ( E.g. You may find you can match 80% of it by just adding some common alternate match strings like "Blah" and "BLAH INC" instead of just "Blah Inc.")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow