Frage

Good day,

This is my use case: Each user has a wishlist of items and an inventory of items they are offering. The amount of items is a definite number while the users can be any number.

My goal is to give the logged in user a recommendation or a list of users whose inventory matches his wishlist based on an algorithm. The caveat is, I need to be able to sort the results in such a way that the user who has the most complete of offerings based on his wishlist come up at the top and sort them in a descending manner. I need to be able to present this in a paginated way so I'm hoping the query can finish in under 3 seconds using commodity virtual server specs.

Now onto my data, let's say for simplicity's sake I will only limit each user to 35 unique items on his wishlist and 250 unique items on his inventory. For my test data I entered 50k users each with random wishlist/inventory counts based on the limit. I mapped this with a join in MySQL and I got around 7 million relationships on this test data. Out of curiosity's sake, I tried to query the database by joining the wishlist and inventory tables using the ID of the user who had 35 items in his wishlist. Even with the most optimized query pattern and indexes in all columns involved, it took an empty Rackspace virtual server (2GB RAM, 1vCPU) 21 seconds to finish the query. In order to know that the hardware wasn't the bottleneck, I also tried the query on my home computer which is a lot faster and had more RAM than a commodity server and it took 8 seconds for the query to finish, still a bit far off from my intended less-than-3-seconds goal.

Just to make sure I tried everything before deciding to use a graph database, I did the same test on MongoDB and the only way I could apply my matching algorithm is through MapReduce. It resulted to a 9 second query on the remote server while being 3 seconds on my home computer. It's still not feasible for my use case because MapReduce is very taxing to the server, imagine 500 users doing the query at the same time.

Now onto the algorithm I am talking about:

  1. Take all the things on the user's wishlist and get a list of users who offer those items.
  2. For each user, get all the items that match those in the wishlist, if they offer more than what is asked, then just use the quantity wished for.
  3. Aggregate these counts and get a final percentage of the wishlist matched.

Let's have some sample data:

# users
------------
uid | name
------------
1   | Ramon
2   | Mark
3   | Ralph
------------

# wishlist
--------------------------
pkid | uid | item_id | qty
--------------------------
1    | 1   | 1       | 2
2    | 1   | 2       | 5
3    | 1   | 3       | 1
--------------------------

# offers
--------------------------
pkid | uid | item_id | qty
--------------------------
1    | 2   | 1       | 1
2    | 3   | 2       | 2
2    | 2   | 3       | 7

Which led me to design the graph this way:

enter image description here

So starting from the node Ramon, traverse the graph to get other users who have offers for me. The following should be the preliminary results prior to aggregation:

uid | item_id | wishlist_qty | offer_qty
----------------------------------------
2   | 1       | 2            | 1
2   | 3       | 1            | 1  # this should be 7 but we only need 1
3   | 2       | 5            | 2
----------------------------------------

With the data above, we can now formulate which user has the most of the user's wishlist by doing: sum(offer_qty) / sum(wishlist_qty) and then rank order the users based on this result in descending order which would give us something like this:

uid | percentage
----------------
2   | 0.67
3   | 0.4
----------------

There you have it, that's the recommendation algorithm I want to achieve. I am new to graph databases so I need a nudge in the right direction if this is achievable and will perform well in the environment and number of users I intend for it. If you have other suggestions, may it be to use other kinds of databases like column store or alter my data model to make it work for this use-case and intended environment, feel free to suggest them but please include how I can make it work with my scenario.

I hope I have been complete in illustrating my programming problem. Thanks in advance for your answers.

Ramon

War es hilfreich?

Lösung

Taking your question to be whether a graph database would perform better, and whether it would perform well enough, the answer is definitely yes and probably yes. It will certainly perform better than what you've tried so far, and if you model your data well, and you already seem to have that down, it will perform well enough (within your requirements). I would recommend Neo4j, it is an excellent choice for a recommendation engine like yours. I took a stab at representing your model in Neo4j Console, feel free to play around with it. You won't get any benchmarks out of that, but it will give you a feel for what it would be like to work with.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top