Question

I'm implementing a movie recommendation system with real user data. I planned to take the collaborative filtering method. However, this kind of methods usually involve a huge matrix storing the users their rated movies. Since I have more than ten thousand movies and hundred thousand users. It is impossible for me to create such a huge sparse matrix. I wonder how everyone implement collaborative filtering with such a large amount of data? Thanks!

Was it helpful?

Solution

I would recommend distributed computing frameworks to you, but, I think is still of a scale that you can easily handle it on one machine.

Apache Mahout contains the Taste collaborative filtering library, which was designed to scale on one machine. A model of -- what, 10M data points? -- should fit in memory with a healthy heap size. Look at things like GenericItemBasedRecommender and FileDataModel.

(Mahout also has distributed implementations based on Hadoop, but I don't think you need this yet.)

I'm the author of that, but have since moved on to commercialize large-scale recommenders as Myrrix. It also contains a stand-alone single machine version, which is free and open source. It also will easily handle this amount of data on one machine. For example, this is a smaller data set than what's used in this example. Myrrix also has a distributed implementation.

There are other fast distributed implementations beyond the above, like GraphLab. Other non-distributed frameworks are also probably fast enough, like MyMediaLite.

I would suggest just using one of these, or if you really are just wondering "how" it happens, check into the source code and look at the data representation.

OTHER TIPS

I didn't use the matrix form to store my data. Instead, I use C++ and build some structs like User, Rating, Item which contain the variables and arrays that I need. This may increase the complexity of algorithm but it can save the memory efficiently.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top