Limitations
If you refuse to use a dictionary you're algorithm will require a lot of computation. Above that, it is impossible to distinguish a keyword that occurs only once (e.g: "karl") from a crappy sequence (e.g: "e2bo"). My solution will be a best effort and will only work if your list of URL's contains keywords multiple times.
The basic idea
I assume a word is a sequence of characters that occur frequently of at least 3 characters. This prevents the letter "o" from being the most popular word.
The basic idea is the following.
- Count all n letter sequences and select the once that occur multiple times.
- Cut all sequences that are a part of a larger sequence.
- Order them by popularity and you have a solution that comes close to solving your problem. (Left as an exercise to the reader)
In code
import operator
sentences = ["davidbobmike1joe" , "mikejoe2bobkarl", "joemikebob", "bobjoe", "bobbyisawesome", "david", "bobbyjoe"];
dict = {}
def countWords(n):
"""Count all possible character sequences/words of length n occuring in all given sentences"""
for sentence in sentences:
countWordsSentence(sentence, n);
def countWordsSentence(sentence, n):
"""Count all possible character sequence/words of length n occuring in a sentence"""
for i in range(0,len(sentence)-n+1):
word = sentence[i:i+n]
if word not in dict:
dict[word] = 1;
else:
dict[word] = dict[word] +1;
def cropDictionary():
"""Removes all words that occur only once."""
for key in dict.keys():
if(dict[key]==1):
dict.pop(key);
def removePartials(word):
"""Removes all the partial occurences of a given word from the dictionary."""
for i in range(3,len(word)):
for j in range(0,len(word)-i+1):
for key in dict.keys():
if key==word[j:j+i] and dict[key]==dict[word]:
dict.pop(key);
def removeAllPartials():
"""Removes all partial words in the dictionary"""
for word in dict.keys():
removePartials(word);
for i in range(3,max(map(lambda x: len(x), sentences))):
countWords(i);
cropDictionary();
removeAllPartials();
print dict;
Output
>>> print dict;
{'mike': 3, 'bobby': 2, 'david': 2, 'joe': 5, 'bob': 6}
Some challenges to the reader
- Sort the dictionary by value before printing it. (Sort a Python dictionary by value)
- In this example "bob" occurs six times, 2 times it is a partial word of "bobby". Determine if this is problematic and fix it if necessary.
- Take capitalization into account.