Finding similarity between two multi dimensial arrays (as JSON documents)

https://stackoverflow.com/questions/20348633

07-08-2022
|

Question

Basically I have a filesystem full of files like this:

{"excellent":[1,2],"good":[1,2,3,4,5],"okay":[6],"bad":[7,8,9],"horrible":[9]}

"excellent" will always be a subset of "good," "horrible" will always be a subset of "bad," and good, bad, and okay will always be unique. Each set, however, can be arbitrarily long, from 0 elements to ... arbitrary. I can't assume the length of two sets will be equal, hence why Jaccard seems to apply to the individual sets.

What I need to do is compute:

Top 25 documents most similar to this one.
Bottom 10 documents least similar to this one.
A value between 0-1 between this document and all other documents in the set.

Essentially the output should be another json document like this:

{"similar":[[1,0.987],[2,0.876],[3,0.765]...],"differs":[[4,0.012],[5,0.123],[6,0.234]...],"totalSimilarity":0.456}

I've looked at Jaccard Indices, and that looks good for a simple array. This is a set of five arrays, and it's very important if two documents both have similarity between their good and excellent sets, for example.

Can I just "average" the five Jaccard Indicies into one master index number? Or does that loose too much information?

Am I trying too hard? Would something like a Minhash of the json strings work? My first thought is that it might, but then I'm concerned that something based on string hashing would consider "5" being in "horrible" as "similar" to "5" being in "good" and that's completely backwards. Additionally, I'm afraid this would be hampered by the fact that some users may have a hundred entries under "good" while others simply have five, and a string based calculation might choke on that.

And to be honest, while I want the value for #3 above, I really haven't a clue as to how to calculate this. I'd like to know how similar a document is to the entire corpus.

Yes, this is a similar to a recommender system algorithm. I've read the documentation from EasyRec to Mahout and either they don't seem to do exactly what I need or the maths start going way over my head. I'm a PHP developer, not a theoretical physicist. Systems like EasyRec and Mahout, by default, don't seem to "understand" the fact that these are five separate sets that ALL need to align to be considered "similar..." or they require some serious programming effort in their frameworks that leave me a bit dizzy.

However, interestingly enough (at least to me), most recommender systems work on the entire dataset; I'm specifically interested in clustering users based on their preferences and making recommendations based on that very small set. I'm honestly not too worried about saying "dunno, Buck! You're too unique" so I'm not too concerned about cold start or anything like that.

So I'm looking for advice, either an algorithm, support that the average-over-Jaccard would work, or at least some documentation to read to get a better grasp of it all. I recognize that recommender systems are some serious computer science, and that there are solid systems out there that already do it. My problem is that many of them seem to be way overthinking the problem, or require somebody to have a BA in Statistics to be able to get it to output sane output. (Which is a bit unfair... it's not the universe's fault that a problem is hard.)

Finally, I'm more interested in finding an algorithm to use than a library or server to do it (unless there really is a simple drop in that I'm simply not understanding), as I'd like to see how various programming languages would handle the algorithm in a bit of a shootout.

I find that most recommendation engines/rating systems fail because their datasets are too broad. Trying to average a score over the whole universe creates mediocrity (see: Metacritic, Rotten Tomatoes). Basically I want to be able to say, transparently, to a specific user:

Users most similar to you (link, link, link) really like this, and users lest similar to you (link, link, link) really hate this, so you may be interested.

Users most similar to you (link, link, link) agree that this is good, so you might be interested.

Users least similar to you (link, link, link) agree that this is horrible, so you might be interested.

Users least similar to you (link, link, link) agree that this is good, so either you'll hate it, or we'll have to go back to the drawing board.

I want to spell it out to the user, EXACTLY how I came to think they'd like X, as many recommendations on websites are black boxes, and I find it can be extremely bad with zero explanation. (Beer websites recommending three IPAs after I consistently rate every IPA I drink a "1" simply because there are a lot of IPAs and the world seems to like IPAs.)

That's also why I want the "totalSimilarity". If your "totalSimilarity" is low then the recommender will be flailing around, and I'd like to display that as an excuse. "You're very unique, so these are really just flailing guesses." "You're not very unqiue, which is actually good because I'm pretty sure this data is useful to you personally."

Solution

Here is first sketch of a solution:

1) remove redundancies from your data:

you say that "excellent" will always be a subset of "good", "horrible" will always be a subset of "bad". Then your data is redundant. Remove the excellent entries from the good entries, and the horrible entries from the bad entries.

2) Use numerical weights instead of words. You may want to use the following mapping: excellent -> 2, good -> 1, ok -> 0, bad -> -1, horrible -> -2

This way you end up with a user item matrix. With one line per user and one column per
item. The numerical entries represent preferences of a user for a given item. The resulting matrix will be sparse and high dimensional. You will need to apply a dimensionality reduction mechanism such as principal component analyis or singular value decomposition.

3) Once you have reduced the dimensionality of your problem, you can compute the similarity between users and items using the dot product within the reduced space.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow