Looking for a faster algorithm to count tags/keywords/labels on a document database for a dynamic tagcloud
Pregunta
Current state
- .NET 4.0 Application (WPF)
- Database: SQLCE
- Tables (simplified): Documents, Tags, DocumentsTags [n:n]
- roughly 2000 documents and 600 tags (tags can be assigned to multiple documents)
- tags = keywords = labels
Case
The user has a big document database, which he can filter with a tag cloud. The tags displays a name (the tag name itself) and a number, which is the total count of the documents with the respective tag. If the user selects a tag, only the documents with the selected tag are shown. The dynamic tag cloud now should show only the available tags on the filtered documents with an updated count number.
Problem
It is slow. After each selected tag, we need to evaluate again all the documents to count the tags. We currently do it recursively, so we check on each document what tags it has. We are looking for another solution (caching, better algorithm, your idea?).
Similarities
stackoverflow, del.icio.us also have tag clouds. Check out yourself. How do they do it? I know stored procedures would be a solution, but according our database developer this is not available on SQLCE.
Solución
You can use two inverted indexes, where each tag will be a key in both.
One inverted index will actually be a map:Tags->list of Tags
[all the tags that co-occure with the key]
The second one will be map:Tags->list of Docs
[all the documents that co-occure with each tag].
Calculating the relevant set of docs after some tags were selected is simply an intersection on inverted index, that can be done efficiently.
Also, finding the modified tags cloud is again an intersection on inverted index.
Note that the inverted index can be created off-line, and creating it is a classic example of map-reduce usage.
This thread discuss how to efficiently find intersection in inverted index
Otros consejos
You should do your second stage search in a single query, something like
SELECT
tags.id AS tagid,
tags.name AS tagname,
count(*) AS tagcount
FROM
tags
INNER JOIN DocumentsTags AS tda on tda.tagid=tags.id
INNER JOIN DocumentsTags AS tdb on tda.documentid=tdb.documentid
WHERE
tdb.tagid=<selected tag id>
GROUP BY
tags.id
Edit
After your comment, this is what you should use for the first stage query (i.e.: No tag yet selected, all documents in list)
SELECT
tags.id AS tagid,
tags.name AS tagname,
count(*) AS tagcount
FROM
tags
INNER JOIN DocumentsTags AS tda on tda.tagid=tags.id
GROUP BY
tags.id