Question

Okey so I want to implement a Collaborative Filter algorithm in Java, similar to Netflix's or StumbleUpon's recommendation algorithms, however I'm not sure if I should do all the computations (Pearson Correlation, Prediction Computation, etc.) on the database, or if I should load all necessary data and do the algorithm's in Java.

I think the main drawback of doing it in java is that I have to load all the data, on the contrary I think doing it in the database will lead to very complex, error prone queries.

What other advantages or disadvantages do each possibility have?

The Algorithm I'm implementing can be found here.

Was it helpful?

Solution

While I haven't read all the details of the algorithm, I would lean toward doing the actual algorithm implementations in code for several reasons. First, you can likely leverage existing implementations of these algorithms (or at least partial implementations) that are well tested. As you have mentioned, adding this logic to the database can be complex and more difficult to test. Also, if you change your storage engine or format, the code may be tightly coupled to the database making it difficult to reuse.

If you are doing the algorithm in java, you will have to read the data out of the database, which could lead to large amounts of data in memory. You'll need to make sure this doesn't become a limiting factor though - do you need to read ALL of the data at once (which means that at some point RAM will become a limitation), or can you chunk the data and parallelize the operations? If you can parallelize parts of the algorithm, writing the code in Java (or whatever language you choose) will make it easier to split the data (you might even consider using a Map/Reduce framework if the problem fits that framework - again here I haven't read through the algorithm details).

In general, I would try to keep business logic out of the database.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top