Question

I am building a recommend system on Hadoop in a simple way can u give me an opinion on what to use to build this recommendation system.

I would like to use Apache pig or Apache mahout.

In my data set i am having

book_id,name,publisher
user_id,username
book_id,user_id,rating

i have my data in c.s.v format

so can you please suggest me which technology to use to produce item based and user based recommendation system.

Was it helpful?

Solution

Apache Mahout will provide you with a off-shelf recommendation engine based on collaborative filtering algorithms.

With Pig you will have to implement those algorithms yourself - in Pig Latin, which may be a rather complex task.

OTHER TIPS

I know it's not one of your preferred methods, but another product you can use on Hadoop to create a recommendation engine is Oryx.

Oryx was created by Sean Owen (co-author of the book Mahout in Action, and a major contributor to the Mahout code base). It only has 3 algorithms at the moment (Alternating Least Squares, K-Means Clustering, and Random Decision Forests), but the ALS algorithm provides a fairly easy to use Collaborative Filtering engine, sitting on top of the Hadoop infrastructure.

From the brief description of your dataset, it sounds like it would would perfectly. It has a model generation engine (computational layer), and it can generate a new model based on one of 3 criteria:

1) Age (time between model generations)
2) Number of records added
3) Amount of data added

Once a generation of data has been built, there's another java daemon that runs (the service layer), which will serve out the recommendations (user to item, item to item, blind recommendations, etc) via a RESTful API. When a new generation of the model is created, it will automatically pick up that generation and serve it out.

There are some nice features in the model generation as well, such as aging historic data, which can help get around issues like seasonality (probably not a big deal if you're talking about books, though).

The computational layer (model generation) uses HDFS to store/lookup data, and uses MapReduce or YARN for job control. The serving layer is a daemon that can run on each data node, and it accesses the HDFS filesystem for the computed model data to present out over the API.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top