Question

I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items.

My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.

Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.

So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.

My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations.

Thanks in advance!

UPDATE tl;dr

To clarify what I want, here is my targeted scenario:

  • User generate entries (think of documents)
  • User edit entry meta data (think of tags)

And here is what my system should provide:

  • List of similar entries to a given item as recommendation
  • Clusters of similar entries

Both calculations should be based on:

  • The meta data/attributes of entries (i.e. usage of similar tags)
  • Thus, the distance of two entries using appropriate metrics
  • NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).

Ideally, the algorithm should support:

  • permanent changes of attributes of an entry
  • incrementally compute similar entries/clusters on changes
  • scale
  • something better than a simple distance table, if possible (because of the O(n²) space complexity)
Was it helpful?

Solution

Instead of writing from scratch take a look at mahout.apache.org. It has the clustering algorithms you are looking for as well as the recommendation algorithms. It works alongside Hadoop, so you can scale it out easily.

What this will allow you to do is determine similar documents in a cluster based on your keywords and/or description of the video.

https://cwiki.apache.org/MAHOUT/k-means-clustering.html

has a quick tutorial about clustering of documents using a Reuters dataset. It is quite similar to what you are trying to achieve. Mahout includes recommendation algorithms such as slope one, user based, item based and is incredibly easy to extend. It also has some pretty useful clustering algorithms which support dimension reduction features. This is useful for you in case your matrix is sparse (that is, a lot of tags that have very few usage stats).

Also take a look at Lucene to use its tfidf features to cluster tags and documents. Also check Solr. Both are Apache projects.

OTHER TIPS

Recommendation Algorithm would be very helpful as it lists standard algorithm for dealing with your purpose.

Updated:

I guess what you are looking at is Collaborative Quality Filtering and not only Collaborative Filtering, I have attached link to paper, hope this helps.

K-means clustering may be what you want.

N.B.:

The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results ... It works very well on some data sets, while failing miserably on others.

So you should consider how many clusters, how many tags, and what metric.

See also Stack Overflow questions/tagged/k-means.

http://taste.sourceforge.net/old.html

Taste is a flexible, fast collaborative filtering engine for Java. The engine takes users' preferences for items ("tastes") and returns estimated preferences for other items. For example, a site that sells books or CDs could easily use Taste to figure out, from past purchase data, which CDs a customer might be interested in listening to.

Taste provides a rich set of components from which you can construct a customized recommender system from a selection of algorithms. Taste is designed to be enterprise-ready; it's designed for performance, scalability and flexibility. It supports a standard EJB interface for J2EE-based applications, but Taste is not just for Java; it can be run as an external server which exposes recommendation logic to your application via web services and HTTP.

http://savannah.nongnu.org/projects/cofi/

Currently, programmers who want to use collaborative filtering have to read the literature and implement their own algorithms. More often than not, programmers probably design their own algorithms and they will generally produce suboptimal algorithms. We want to build a foundation of already tested algorithms and documented that can be used in a wide range of contexts from research to applications. The guiding principle is that the design should be thin. Cofi doesn't want to be all things for all people. So the focus is on delivering very few lines of code, and to rely on the programmer for providing the necessary glue.

Few more here

Before starting to implement, adapt or use existing library, make sure you know the domain; reading something like "Collective Intelligence in Action" is a good start.

You want an item-based collaborative filtering rather than user-based. There are a number of algorithms for this floating around on Google. Item-based solutions always scale better than user-based solutions. Item based collaborative filtering in PHP has some easy-to-follow example code and fits what you're looking for:

You have to decide what the similarity metric is based on the specifics of your product and your good sense. Is length of video important? If so it deserves high weight.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top