Question

I have huge list (200000) of strings (multi word). I want to group these strings based on comman array of word match among these strings. I cant think of a low computation time algorithm for this

"AB 500"
"Bus AB 500"
"News CA"
"News CA BLAH"

My plan was
a. Tokenize them to words.
b. Create a global array tokens
c. Compare those strings with common tokens.

As you guessed this does not help. Can you suggest an algorithm for this? I am writing this in python..

Was it helpful?

Solution

200000 is not that much, you can do this

  1. Split each string to get tokens e.g. "News CA BLAH" -> ["Blah", "CA", "News"]
  2. create a dict entry each length of list e.g. in case of ["Blah", "CA", "News"] all combinations in order
  3. Now just loop thru the dict and see the groups

example code:

data="""AB 500
Bus AB 500
News CA
News CA BLAH"""

def getCombinations(tokens):
    count = len(tokens)
    for L in range(1,count+1):
        for i in range(count-L+1):
            yield tuple(tokens[i:i+L])

groupDict = {}
for s in data.split("\n"):
    tokens = s.split()
    for groupKey in getCombinations(tokens):
        if groupKey not in groupDict:
            groupDict[groupKey] = [s]
        else:
            groupDict[groupKey].append(s)

for group, values in groupDict.iteritems():
    if len(values) > 1:
        print group, "->", values

it outputs:

('News', 'CA') -> ['News CA', 'News CA BLAH']
('AB',) -> ['AB 500', 'Bus AB 500']
('500',) -> ['AB 500', 'Bus AB 500']
('CA',) -> ['News CA', 'News CA BLAH']
('AB', '500') -> ['AB 500', 'Bus AB 500']
('News',) -> ['News CA', 'News CA BLAH']

OTHER TIPS

Do you mean something like this?

>>> from collections import defaultdict
>>> L=["AB 500",
... "Bus AB 500",
... "News CA",
... "News CA BLAH"]
>>> d=defaultdict(list)
>>> for s in L:
...     for w in s.split():
...         d[w].append(s)
... 
>>> print d["News"]
['News CA', 'News CA BLAH']
>>> print d["CA"]
['News CA', 'News CA BLAH']
>>> print d["500"]
['AB 500', 'Bus AB 500']

Unless repetition of words is an important feature for your use case, I suggest sets. I.e.:

thestrings = [
"AB 500",
"Bus AB 500",
"News CA",
"News CA BLAH",
]

thesets = dict((s, set(s.split())) for s in thestrings)

similarities = dict()
for s in thestrings:
  for o in thestrings:
    if s>=o: continue
    sims = len(thesets[s] & thesets[o])
    if not sims: continue
    similarities[s, o] = sims

for s, o in sorted(similarities, similarities.get, reverse=True):
  print "%-16r %-16r %2d" % (s, o, similarities[s, o])

Is this close to what you're looking for? It does classify the 4 strings you give in the way you desire, but that's a very feeble sample, of course, so I'm double checking;-).

What would happen, if the string "AB 500 News CA" is added to your list? Do the two groups of strings have to merge? If not, how to split up the list of strings and why?

A very general workflow for problems like this (if i understood it correctly) goes like this:

  1. Get a list of candidate pairs via an inverted Index/All pairs similarity search/Simhashing
  2. Calc some distance functions for each pair and combine them into a single weight
  3. Each weighted pair ((a, b), weight) represents now an edge in a graph, which you can cluster into the "word-match groups" via hierarchical clustering/power iteration
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top