Question

I have been using tokenizer of NLP, However I am not sure about the behavior, when it comes to places. If I give New York or Mexico City, the tokenizer is splitting that into New and York respectively.

However I want it to be just New York. Are there any tokenizers to achieve this, and if not how to achieve this result?

Thanks

Was it helpful?

Solution

Your tokenizer is behaving correctly. New and York are two different tokens. What you want to do is something called chunking. Here is some information about chunking to give you some background.

Depending on which NLP library you are using, there is probably some functionality built in for chunking. For OpenNLP, which you included in your question tags, see this related question: How to extract the noun phrases using Open nlp's chunking parser

OTHER TIPS

For matching one or two tokens you need a recursive set of some sort.

Single tokens (Washington, Miami).

Potential first prefix tokens, that have something after them:

New (Haven, York)

San (Fransisco).

Essentially you match on the single tokens first, then prefix tokens and affect the parsing of the second token.

One way to do it is to use hm = HashMap<String, HashSet<String>> like

hs = new HashSet()
hs.add("Haven");
hs.add("York");

hm = new HashMap();
HashMap.put("New", hs);

and when you get a match in hm's keySet, use the corresponding value to match on next token (but don't forget it could be a false match!!)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top