Question

I have a list of strings (company names, in this case), and a Java program that extracts a list of things that look like company names out of mostly-unstructured text. I need to match each element of extracted text to a string in the list. Caveat: the unstructured text has typos, things like "Blah, Inc." referred to as "Blah," etc. I've tried Levenshtein Edit Distance, but that fails for predictable reasons. Are there known best-practices ways of tackling this problem? Or am I back to manual data-entry?

Was it helpful?

Solution

This is not a simple problem, and there are entire companies built around trying to solve it (even for reduced matching sets like company names versus the general case).

If you can identify a discrete number of patterns that valid company names fall into, and that noise does not fall into, then you could tackle this with a series of regular expression matches.

If the patterns are difficult or too numerous, then you could try developing a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for a quick validation, and grow the network. Techniques might include genetic programming or setting up a neural network. This approach is obviously not lightweight, and you'd probably want to consider your need carefully before going down this road.

OTHER TIPS

You might want to have a look at Apache Stanbol, it plugs together NER engines (I think one is based on a gazetteer you supply) and linking engines to resolve your detected entities. I haven't used it myself and it's still in incubation, but might suit what you're looking for.

There is also a bit of research in this space in the TAC Knowledge Base Population track (Entity Linking). The task pops-up in different places and you should also have luck in conferences like ACL, EMNLP, SIGIR, etc (this list is by no means complete).

The TAC systems link to a subset of Wikipedia, which might help with your name variation since pages have "redirects", which are essentially aliases for a particular page.

For example, following pages redirect to "Apple Inc.", but you probably want to extract the redirects either from a raw Wikipedia dump or from a clean source like DBPedia or Freebase.

  • AAPL
  • Apple Company
  • Apple Computer
  • Apple Computer Co.
  • Apple Computer Inc.
  • Apple Computer Incorporated
  • Apple Computer, Inc
  • Apple Computer, Inc.
  • Apple Inc
  • Apple Incorporate
  • Apple Incorporated
  • Apple compputer
  • Apple computer Inc
  • Apple inc
  • Apple inc.
  • ...

In the work we do at my company we deal with this type of problem all the time. The most successful efforts I have seen use just a few pages of Python code. Python is excellent for string dissection and analysis, and you can call a Python routine from your Java program. Like Greg said, the right answer is highly dependent on the quality of your unstructured text. A good way to start is to quantitatively characterize how it aligns with your golden text. ( E.g. You may find you can match 80% of it by just adding some common alternate match strings like "Blah" and "BLAH INC" instead of just "Blah Inc.")

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top