Pregunta

I am doing a simple collaborative filtering(CF). It is an item-to-item CF. For example, I have a giant dict containing N items, where the key is the name of product, and value is a list of customers who bought them:

d={
item1:[customer1,customer3,customer7], 
item2:[customer3, customer5],
...
itemN:[customerX...customerY]
}

I also have a little function that calculates the similarity of customers between each item, e.g. item1 vs item2:

def littlefunction(...):

    #convert them to a set
    item1=set(d['item1'])
    item2=set(d['item2'])

    commonCustomer=item1.intersect(item2)
    totalCustomer=item1.union(item2)

    similarity=float(len(commonCustomer))/len(totalCustomer)

To obtain the top similar items for each item specified, I have to scan, and calculate the similiarty for N times then sort. So for N items the complexity is O(N*N).

The running time for each item is now 2 minutes (N approx=3millions). To generate a complete N*N similarity table it will take 100,000 hours. Any algorithms better than this brute force approach? Only the top few results are needed for each item.

¿Fue útil?

Solución

Create an inverted index that has:

customer1: [item1, item3, item8, ...]
customer2: [item7, item8, item74, ...]

Then you can:

  1. Look up an item to get the list of customers who bought it
  2. Look up each customer to get the list of items that customer bought

Your time per item should go from 2 minutes to less than 2 seconds.

It requires more memory for that second index, but you're not duplicating the data. And if memory is a problem, you can store this in a simple database and still be much faster than the N^2 algorithm that you're currently using.

More detail

You want to create an N*N matrix that shows the similarity between any two items. Using my technique, you do the following:

Create an N*N matrix, and initialize it to 0.
for each item
  Get the list of customers who bought the item (from your item-to-customer index).
  Create an empty dictionary of related items
  for each customer in that list
    for each item that the customer bought
      update the dictionary (add new item, or increase count)
    end for
  end for
  You now have a dictionary that contains the related items,
  and how many customers bought each one. You can update the matrix row
  for the current item from that dictionary.
end for

Otros consejos

This is essentially the same idea as @Jim Mischel's answer, build an explicit reverse index, but does not build an explicit NxN matrix, and is an implementation rather than a sketch of the full system.

For large numbers of NUM_CUSTOMERS (say 1,000,000) it doesn't perform very well (~20sec for 3 similiarty_for checks), but the data is random. Presumably, there would be more correlation and less randomness in actual customer purchase data.

First, build some synthetic data for testing:

import random
import operator

NUM_CUSTOMERS = 10000
NUM_ITEMS = 1000
MIN_PER_CUST = 2
MAX_PER_CUST = 10
NUM_INTEREST = 10

customers = ["customer_{}".format(num) for num in xrange(1, NUM_CUSTOMERS+1)]
items = ["item_{}".format(num) for num in xrange(1, NUM_ITEMS+1)]

customer_items = {
    customer: random.sample(items, random.randint(MIN_PER_CUST, MAX_PER_CUST))
    for customer in customers}

item_customers = {}
for customer, this_items in customer_items.iteritems():
    for item in this_items:
        item_customers.setdefault(item, [])
        item_customers[item].append(customer)

Then define a couple of functions:

def similarity(item1, item2):
    item1_set = set(item_customers[item1])
    item2_set = set(item_customers[item2])

    num_common = len(item1_set.intersection(item2_set))
    num_total = len(item1_set.union(item2_set))

    return float(num_common) / num_total

def similarity_for(item):
    to_check = {
        itm for customer in item_customers[item]
            for itm in customer_items[customer] }
    to_check.discard(item)
    rankings = {
        item2: similarity(item, item2)
        for item2 in to_check }
    return sorted(rankings.iteritems(), key=operator.itemgetter(1), reverse=True)

And, run it:

for index, item in enumerate(sorted(random.sample(items, 3))):
    rankings = similarity_for(item)
    print "check {}; {} (of {}):".format(index, item, len(rankings))
    for item_other, ranking in rankings[:NUM_INTEREST]:
        print "  {}: {}".format(item_other, ranking)

Getting results that look like:

% python /tmp/similarity.py
check 0; item_121 (of 309):
  item_520: 0.0283018867925
  item_361: 0.027027027027
  item_536: 0.0265486725664
  item_637: 0.0238095238095
  item_515: 0.020202020202
  item_750: 0.019801980198
  item_960: 0.0192307692308
  item_25: 0.0190476190476
  item_548: 0.018691588785
  item_841: 0.018691588785
check 1; item_714 (of 298):
  item_162: 0.0285714285714
  item_491: 0.0272727272727
  item_617: 0.0265486725664
  item_949: 0.0260869565217
  item_788: 0.0192307692308
  item_446: 0.0190476190476
  item_558: 0.018691588785
  item_606: 0.0181818181818
  item_177: 0.0181818181818
  item_577: 0.018018018018
check 2; item_991 (of 352):
  item_758: 0.0298507462687
  item_204: 0.025
  item_85: 0.0247933884298
  item_769: 0.024
  item_501: 0.0232558139535
  item_860: 0.0227272727273
  item_408: 0.0225563909774
  item_480: 0.0223880597015
  item_73: 0.0220588235294
  item_593: 0.021897810219
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top