English word segmentation in NLP?

Question 1

Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.

I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.

http://jeremykun.com/2012/01/15/word-segmentation/

Question 2

The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Installation is easy with pip:

$ pip install wordsegment

Simply call segment to get a list of words:

>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']

As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.

>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']

I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.