Methods for extracting locations from text?

Question 1

All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)

This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml

You can easily find implementations in other programming languages.

Question 2

Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.

Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.

Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.

As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.

Question 3

How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries? A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer. Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer. I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy