Question

Company 1 has this vector:

['books','video','photography','food','toothpaste','burgers'] ... ...

Company 2 has this vector:

['video','processor','photography','LCD','power supply', 'books'] ... ...

Suppose this is a frequency distribution (I could make it a tuple but too much to type).
As you can see...these vectors have things that overlap. "video" and "photography" seem to be "similar" between two vectors due to the fact that they are in similar positions. And..."books" is obviously a strong point for company 1. Ordering and positioning does matter, as this is a frequency distribution.

What algorithms could you use to play around with this? What algorithms could you use that could provide valuable data for these companies, using these vectors?

I am new to text-mining and information-retrieval. Could someone guide me about those topics in relation to this question?

Was it helpful?

Solution

Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? That's impossible to say without more understanding of the actual application area. You can try various ways to deal with this issue and see what results they give on example cases you care about!

For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!):

def makedict(avector):
  return dict((item, i) for i, item in enumerate(avector))

and then, to compare two such dicts:

def comparedicts(d1, d2):
  allitems = set(d1) | set(d2)      
  distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
  return sum(d * d for d in distances)

(or, abs(d) instead of the squaring in the last statement). To make missing items weigh more (make dicts, i.e. vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program.

OTHER TIPS

I would suggest you a book called Programming Collective Intelligence.
It's a very nice book on how you can retrieve information from simple data like this one. There are code examples included (in Python :)

Edit: Just replying to gbjbaanb: This is Python!

a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)

a.intersection(b)
    set(['photography', 'books', 'video'])

b.intersection(a)
    set(['photography', 'books', 'video'])

b.difference(a)
    set(['LCD', 'power supply', 'processor'])

a.difference(b)
    set(['food', 'toothpaste', 'burgers'])

Take a look at Hamming Distance

As mbg mentioned, the hamming distance is a good start. It's basically assigning a bitmask for every possible item whether it is contained in the companies value.

Eg. toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. The Jaccard coefficient is related to this.

Hamming distance will actually not be able to capture similarity between things like "video" and "photography". Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste.

For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (e.g. neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance.

pick the rank of each entry (higher rank is better) and make sum of geometric means between matches

for two vectors

sum(sqrt(vector_multiply(x,y)))  //multiply matches

Sum of ranks for each value over vector should be same for each vector (preferrebly 1) That way you can make compares between more than 2 vectors.

If you apply ikkebr's metod you can find how a is simmilar to b

in that case just use

sum( b( b.intersection(a) ))

You could use the set_intersection algorithm. The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. There are a few others that operate similarly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top