Looking for a faster algorithm to count tags/keywords/labels on a document database for a dynamic tagcloud

https://stackoverflow.com/questions/9292682

29-04-2021
|

Question

Current state

.NET 4.0 Application (WPF)
Database: SQLCE
Tables (simplified): Documents, Tags, DocumentsTags [n:n]
roughly 2000 documents and 600 tags (tags can be assigned to multiple documents)
tags = keywords = labels

Case

The user has a big document database, which he can filter with a tag cloud. The tags displays a name (the tag name itself) and a number, which is the total count of the documents with the respective tag. If the user selects a tag, only the documents with the selected tag are shown. The dynamic tag cloud now should show only the available tags on the filtered documents with an updated count number.

Problem

It is slow. After each selected tag, we need to evaluate again all the documents to count the tags. We currently do it recursively, so we check on each document what tags it has. We are looking for another solution (caching, better algorithm, your idea?).

Similarities

stackoverflow, del.icio.us also have tag clouds. Check out yourself. How do they do it? I know stored procedures would be a solution, but according our database developer this is not available on SQLCE.

Solution

You can use two inverted indexes, where each tag will be a key in both.

One inverted index will actually be a map:Tags->list of Tags [all the tags that co-occure with the key]
The second one will be map:Tags->list of Docs [all the documents that co-occure with each tag].

Calculating the relevant set of docs after some tags were selected is simply an intersection on inverted index, that can be done efficiently.
Also, finding the modified tags cloud is again an intersection on inverted index.

Note that the inverted index can be created off-line, and creating it is a classic example of map-reduce usage.

This thread discuss how to efficiently find intersection in inverted index

OTHER TIPS

You should do your second stage search in a single query, something like

SELECT
  tags.id AS tagid,
  tags.name AS tagname,
  count(*) AS tagcount
FROM
  tags
  INNER JOIN DocumentsTags AS tda on tda.tagid=tags.id
  INNER JOIN DocumentsTags AS tdb on tda.documentid=tdb.documentid
WHERE
  tdb.tagid=<selected tag id>
GROUP BY
  tags.id

Edit

After your comment, this is what you should use for the first stage query (i.e.: No tag yet selected, all documents in list)

SELECT
  tags.id AS tagid,
  tags.name AS tagname,
  count(*) AS tagcount
FROM
  tags
  INNER JOIN DocumentsTags AS tda on tda.tagid=tags.id
GROUP BY
  tags.id

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow