Question

I'm using Python to parse urls into words. I am having some success but I am trying to cut down on ambiguity. For example, I am given the following url

"abbeycarsuk.com"

and my algorithm outputs:

['abbey','car','suk'],['abbey','cars','uk']

Clearly the second parsing is the correct one, but the first one is also technically just as correct (apparently 'suk' is a word in the dictionary that I am using).

What would help me out a lot is if there is a wordlist out there that also contains the fequency/popularity of each word. I could work this into my algorithm and then the second parsing would be chosen (since 'uk' is obviously more common than 'suk'). Does anyone know where I could find such a list? I found wordfrequency.info but they charge for the data, and the free sample they offer does not have enough words for me to be able to use it successfully.

Alternatively, I suppose I could download a large corpus (project Gutenberg?) and get the frequency values myself, however if such a data set already exists, it would make my life a lot easier.

Was it helpful?

Solution

There is an extensive article on this very subject written by Peter Norvig (Google's head of research), which contains worked examples in Python, and is fairly easy to understand. The article, along with the data used in the sample programs (some excerpts of Google ngram data) can be found here. The complete set of Google ngrams, for several languages, can be found here (free to download if you live in the east of the US).

OTHER TIPS

As you mention, "corpus" is the keyword to search for.

E. g. here is a nice list of resources:

http://www-nlp.stanford.edu/links/statnlp.html

(scroll down)

http://ucrel.lancs.ac.uk/bncfreq/flists.html

This is perhaps the list you want. I guess you can cut down on the size of it to increase performance if it's needed.

Here is a nice big list for you. More information available here.

Have it search using a smaller dictionary first, a smaller dictionary will tend to keep more commonly used words. Then if it fails, you could have it use your more compete dictionary that includes words like 'suk'.

You would then be able to ignore word frequency analysis, however you would take a hit to your overhead by adding another smaller dictionary.

You might be able to use will's link that he posted in the comments as a small dictonary

edit also, the link you provided does indeed have a free service where you can download a list of the top 5,000 used words

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top