Tokenizing place like New York

Question 1

Your tokenizer is behaving correctly. New and York are two different tokens. What you want to do is something called chunking. Here is some information about chunking to give you some background.

Depending on which NLP library you are using, there is probably some functionality built in for chunking. For OpenNLP, which you included in your question tags, see this related question: How to extract the noun phrases using Open nlp's chunking parser

Question 2

For matching one or two tokens you need a recursive set of some sort.

Single tokens (Washington, Miami).

Potential first prefix tokens, that have something after them:

New (Haven, York)

San (Fransisco).

Essentially you match on the single tokens first, then prefix tokens and affect the parsing of the second token.

One way to do it is to use hm = HashMap<String, HashSet<String>> like

hs = new HashSet()
hs.add("Haven");
hs.add("York");

hm = new HashMap();
HashMap.put("New", hs);

and when you get a match in hm's keySet, use the corresponding value to match on next token (but don't forget it could be a false match!!)