Recommendation system with large amount of data

Question 1

I would recommend distributed computing frameworks to you, but, I think is still of a scale that you can easily handle it on one machine.

Apache Mahout contains the Taste collaborative filtering library, which was designed to scale on one machine. A model of -- what, 10M data points? -- should fit in memory with a healthy heap size. Look at things like GenericItemBasedRecommender and FileDataModel.

(Mahout also has distributed implementations based on Hadoop, but I don't think you need this yet.)

I'm the author of that, but have since moved on to commercialize large-scale recommenders as Myrrix. It also contains a stand-alone single machine version, which is free and open source. It also will easily handle this amount of data on one machine. For example, this is a smaller data set than what's used in this example. Myrrix also has a distributed implementation.

There are other fast distributed implementations beyond the above, like GraphLab. Other non-distributed frameworks are also probably fast enough, like MyMediaLite.

I would suggest just using one of these, or if you really are just wondering "how" it happens, check into the source code and look at the data representation.

Question 2

I didn't use the matrix form to store my data. Instead, I use C++ and build some structs like User, Rating, Item which contain the variables and arrays that I need. This may increase the complexity of algorithm but it can save the memory efficiently.