Question

I am experimenting apache Open NLP for one of my project, my requirement is to detect nouns out of email contents and check with our customer data base (this DB consist of individual names, organization names etc and my search engine is Solr base).

For normal english nouns, default trained model works properly (for most of the cases), but One of the tricky requirement is, we have business organization with abbreviations like OK, LET etc and thus in few scenarios I need to consider OK, LET etc as noun.

As an example 1) "sending some items to LET, please expect delay in payment" 2) "let us go for a party"

In #1 I want to consider LET as noun and in #2 case LET is not noun.

If I can achieve this requirement, I can reduce significant amount of false positive matches in my search engine.

Any help is highly appreciated.

Was it helpful?

Solution

Make a dictionary of the special nouns and perform dictionary-based extraction as a post-processing step. The dictionary-based extraction should take the distinction between lowercase and uppercase into account, in particular for those entries that are acronyms.

In terms of implementation of the dictionary lookup:

  • As long as the entities in question are single tokens (or consist only of a predefined, small maximum number M of tokens each), implementing the dictionary as HashSet<String>, tokenising the text and making look-ups in the hash for each token (and groups of up to M tokens) should work very well

  • If you are dealing with very long entities, or if tokenization is a problem, the use of a search trie or finite state machine implementation of the dictionary is sensible.

Finally, as always with NLP, you will need to look at a significant sample of the results to identify any further problems. Depending on the level of ambiguity in your entity list, you may need to further refine the detection method by adding either a heuristics or a statistical / ML-based decision mechanism on top of the case-sensitive dictionary look-up.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top