I am working on a tag recommendation system that takes metadata strings (e.g. text descriptions) of an object, and splits it into 1-, 2- and 3-grams.
The data for this system is kept in 3 tables:
- The "object" table (e.g. what is being described),
- The "token" table, filled with all 1-, 2- and 3-grams found (examples below), and
- The "mapping" table, which maintains associations between (1) and (2), as well as a frequency count for these occurrences.
I am therefore able to construct a table via a LEFT JOIN, that looks somewhat like this:
SELECT mapping.object_id, mapping.token_id, mapping.freq, token.token_size, token.token
FROM mapping LEFT JOIN
token
ON (mapping.token_id = token.id)
WHERE mapping.object_id = 1;
object_id token_id freq token_size token
+-----------+----------+------+------------+--------------
1 1 1 2 'a big'
1 2 1 1 'a'
1 3 1 1 'big'
1 4 2 3 'a big slice'
1 5 1 1 'slice'
1 6 3 2 'big slice'
Now I'd like to be able to get the relative probability of each term within the context of a single object ID, so that I can sort them by probability, and see which terms are most probably (e.g. ORDER BY rel_prob DESC LIMIT 25
)
For each row, I'm envisioning the addition of a column which gives the result of freq/sum of all freqs for that given token_size
. In the case of 'a big', for instance, that would be 1/(1+3) = 0.25
. For 'a', that's 1/3 = 0.333
, etc.
I can't, for the life of me, figure out how to do this. Any help is greatly appreciated!