First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
- Run feature selection to avoid noisy variables.
- Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
- Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y)
where X
is a matrix with your data vectors on rows (e.g. details of house being sold) and y
is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X
be a combination of 2 user vectors, and corresponding element of y
- similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH