Question

My problem is the following. I have a long list of URLs such as:

www.foo.com/davidbobmike1joe
www.foo.com/mikejoe2bobkarl
www.foo.com/joemikebob
www.foo.com/bobjoe

I need to compare all the entries (URLs) in that list with each other, extract the keywords in the subdomains of those URLs (in this case: david, joe, bob, mike, karl) and order them by frequency. I've been reading about several libraries such as nltk. However the problem here is that there are no spaces to tokenise each word independently. Any recommendations on how to get the job done?

Was it helpful?

Solution

Limitations

If you refuse to use a dictionary you're algorithm will require a lot of computation. Above that, it is impossible to distinguish a keyword that occurs only once (e.g: "karl") from a crappy sequence (e.g: "e2bo"). My solution will be a best effort and will only work if your list of URL's contains keywords multiple times.

The basic idea

I assume a word is a sequence of characters that occur frequently of at least 3 characters. This prevents the letter "o" from being the most popular word.

The basic idea is the following.

  • Count all n letter sequences and select the once that occur multiple times.
  • Cut all sequences that are a part of a larger sequence.
  • Order them by popularity and you have a solution that comes close to solving your problem. (Left as an exercise to the reader)

In code

import operator

sentences = ["davidbobmike1joe" , "mikejoe2bobkarl", "joemikebob", "bobjoe", "bobbyisawesome", "david", "bobbyjoe"];
dict = {}

def countWords(n):
    """Count all possible character sequences/words of length n occuring in all given sentences"""
    for sentence in sentences:
        countWordsSentence(sentence, n);

def countWordsSentence(sentence, n):
    """Count all possible character sequence/words of length n occuring in a sentence"""
    for i in range(0,len(sentence)-n+1):
        word = sentence[i:i+n]
        if word not in dict:
            dict[word] = 1;
        else:
            dict[word] = dict[word] +1;

def cropDictionary():
    """Removes all words that occur only once."""
    for key in dict.keys():
        if(dict[key]==1):
            dict.pop(key);

def removePartials(word):
    """Removes all the partial occurences of a given word from the dictionary."""
    for i in range(3,len(word)):
        for j in range(0,len(word)-i+1):
            for key in dict.keys():
               if key==word[j:j+i] and dict[key]==dict[word]:
                   dict.pop(key);

def removeAllPartials():
    """Removes all partial words in the dictionary"""
    for word in dict.keys():
        removePartials(word);

for i in range(3,max(map(lambda x: len(x), sentences))):
    countWords(i);
    cropDictionary();
    removeAllPartials();

print dict;

Output

>>> print dict;
{'mike': 3, 'bobby': 2, 'david': 2, 'joe': 5, 'bob': 6}

Some challenges to the reader

  • Sort the dictionary by value before printing it. (Sort a Python dictionary by value)
  • In this example "bob" occurs six times, 2 times it is a partial word of "bobby". Determine if this is problematic and fix it if necessary.
  • Take capitalization into account.

OTHER TIPS

Overview

You could use this code to extract the names, passing in a list of [david, bob, etc.]:

Is there an easy way generate a probable list of words from an unspaced sentence in python?

And then use collections.Counter to get frequencies.

The code

from Bio import trie
import string
from collections import Counter


def get_trie(words):
    tr = trie.trie()
    for word in words:
        tr[word] = len(word)
    return tr

def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def get_trie_words(s):
    names = ['david', 'bob', 'karl', 'joe', 'mike']
    tr = get_trie(names)
    while s:
        word, s = get_trie_word(tr, s)
        yield word

def main(urls):
    d = Counter()
    for url in urls:
        url = "".join(a for a in url if a in string.lowercase)
        for word in get_trie_words(url):
            d[word] += 1
    return d

if __name__ == '__main__':
    urls = [
        "davidbobmike1joe",
        "mikejoe2bobkarl",
        "joemikebob",
        "bobjoe",
    ]
    print main(urls)

Results

Counter({'bob': 4, 'joe': 4, 'mike': 3, 'karl': 1, 'david': 1})
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top