Building a Collaborative filtering / Recommendation System

https://stackoverflow.com/questions/1512624

19-09-2019
|

Question

I'm in the process of designing a website that is built around the concept of recommending various items to users based on their tastes. (i.e. items they've rated, items added to their favorites list, etc.) Some examples of this are Amazon, Movielens, and Netflix.

Now, my problem is, I'm not sure where to start in regards to the mathematical part of this system. I'm willing to learn the math that's required, it's just I don't know what type of math is required.

I've looked at a few of the publications over at Grouplens.org, specifically "Towards a Scalable kNN CF Algorithm: Exploring Effective Applications of Clustering." (pdf) I'm pretty good at understanding everything until page 5 "Prediction Generation"

p.s. I'm not exactly looking for an explanation of what's going on, though that might be helpful, but I'm more interested in the math I need to know. This way I can understand what's going on.

Solution

Let me explain the procedure that the authors introduced (as I understood it):

Input:

Training data: users, items, and ratings of users to these items (not necessarily each user rated all items)
Target user: a new user with some ratings of some items
Target item: an item not rated by target user that we would like to predict a rating for it.

Output:

prediction for the target item by target user

This can be repeated for a bunch of items, and then we return the N-top items (highest predicted ratings)

Procedure:
The algorithm is very similar to the naive KNN method (search all training data to find users with similar ratings to the target user, then combine their ratings to give prediction [voting]).
This simple method does not scale very well, as the number of users/items increase.

The algorithm proposed is to first cluster the training users into K groups (groups of people who rated items similarly), where K << N (N is the total number of users).
Then we scan those clusters to find which one the target user is closest to (instead of looking at all the training users).
Finally we pick l out of those and we make our prediction as an average weighted by the distance to those l clusters.

Note that the similarity measure used is the correlation coefficient, and the clustering algorithm is the bisecting K-Means algorithm. We can simply use the standard kmeans, and we can use other similarity metrics as well such as Euclidean distance or cosine distance.

The first formula on page 5 is the definition of the correlation:

corr(x,y) = (x-mean(x))(y-mean(y)) / std(x)*std(y)

The second formula is basically a weighted average:

predRating = sum_i(rating_i * corr(target,user_i)) / sum(corr(target,user_i))
               where i loops over the selected top-l clusters

Hope this clarifies things a little bit :)

OTHER TIPS

Programming Collective Intelligence is a really user-friendly introduction to the field, with lots of example code in Python. At the very least, it will help set the stage for understanding the math in the academic papers on the topic.

Algorithm of the Intelligent Web (H Marmanis, D Babenko, Manning publishing) is an introductory text on the subjet. It also covers Searching concepts but its main focus is with classification, recommendation systems and such. This should be a good primer for your project, allowing you to ask the right questions and to dig deeper where things appear more promising or practical in your situation.

The book also includes a "refresher" of relevant math topics (mainly linear algebra), but this refresher is minimal; you'll do better on the web.

A pleasant way to discover or get back into linear algebra is to follow Prof. Gilbert Strand's 18.06 lecture series available on MIT OpenCourseWare.

Linear algebra is not the only way to salvation ;-) you may find it useful to brush up on basic statistics concepts such as distribution, covariance, Bayesian inference...

You probably ought to know:

linear algebra
artificial intelligence / machine learning / statistics

Nice to have:

metric spaces
topology
EDA / robust statistics
affine algebra
functional analysis
graph theory

That said, you can go far with just common sense. If you have a list of properties you want your system to satisfy, you will be able to do a lot just by writing code that satisfies those properties.

Examples might be:

never make a "bad" recommendation
score is monotonically increasing in a few parameters
keep the door open for X,Y,Z improvement idea we have for down the line.

From the official documentation of the Abracadabra Recommender API, you start by distinguishing between:

Subjects: These are the entities that you wish to recommend to an user. A movie or an article is for instance a subject. Subjects are characterized that they have certain attributes or content that distinguish them between the various subjects.
Attributes: An attribute is a generic term for a characteristic of a subject. This can be anything and it really depends on how you define the subject. In the example where the subject is a movie, an attribute could be the genre , e.g. adventure, action, sci-fi. An attribute could be also a keyword that is present in the description of this movie, the actor name, the year a movie was published, etc. You name it!
Users: As the name implies, this is the person that wishes to receive recommendations of certain subjects. The user builds an user profile by liking attributes or subjects (and subsequently the attached attributes).
Flow There is a general flow (order in which stuff is done) that is relevant for any type of recommender system and that is also intuitively easy to understand.

The first thing we always need to do is to fill the recommender engine with subjects and their corresponding attributes. Usually this only needs to be done once, but it can also be done dynamically. For instance, if you are recommending articles, then you might want to do this every time an article is added to your website or blog.

The second step is input the preferences of an user. Together with the unique id of your user, you can train the recommender system by liking or disliking certain subjects or attributes. For instance, an user might be shown a list of movies and he/she is given the option to give each movie a rating. Alternatively, the user might build up a profile by inputting which attributes he prefers (e.g. which genres, keywords, release date, etc.). This part is really up to you to decide and the logic of your project.

Once the system has been trained (filled with subjects and user's preferences), then we can call the engine to provide us with recommendations. You can do this once, but also dynamically (thus retraining the model after each feedback you receive from the user). As the user provides more feedback, the model becomes better and the recommendations approach to the actual preferences of the user.

Note that with the Abracadabra Recommender API you only need to send HTTP calls to the API to train your model and to receive recommendations. The API can be accessed using any language, thus either from your website or app (Angular, React, Javascript...) or your server (NodeJS, Curl, Java, Python, Objective-C, Ruby, .NET...).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow