What algorithms count frequencies of common elements in a collection of sets?

https://stackoverflow.com/questions/376019

22-08-2019
|

Question

I would like information on algorithms that can help identify commonality and differences between sets of overlapping data.

Using stackoverflow's tag system as an example:

Let's say this question has been given 5 tags. Let's say there are 1000 other questions that have at least one of these tags. Of these 1000 questions, how many of these questions have tags in common that my original post does not have?

Another more simple way of describing this is an auto-suggest tagging system :

"You tagged your question with [5 tags I selected]. Other similiar questions were tagged with [list of tags that might be of interest]. where [list of tags that might be of interest] are frequently occuring tags that aren't in my orginal list.

Code examples in c# if possible :)

Solution

I don't know of any specific algorithms or data structures, but I could suggest a basic way of handling this:

Assumption: each entry has five unique tags.

Collect all entries containing any one of the five tags (no duplicates).
For each entry in the list, use an associative array (hashtable) for each tag, incrementing the value.
For each entry in the array, append the tag name into that array's entry index.

In (sloppy) pseudo code, use two loops (if possible):

for each entry
    if any tag in original_tags
        tag_list[tag]++
end

for next in tag_list
    tag_count[tag_list[next]] += next
end

This should produce a sparse array of concatenated tag names (ok, I didn't include a separator, but hey it's pseudo code :-). Keep the highest number, then iterate backward for the best suggestions.

(Cache to optimize, but watch for updates)

Paul.

OTHER TIPS

Look into Wager-Hamming distance. This is the Hamming distance defined on strings as the number of edit operations it takes to transform one string into another.

You could also potentially use the partial order of equivalence classes and set inclusion: when questions A and B have the exact same set of tags up to reordering, they're equal, set union, set difference, and set intersection then define a partial order for < and > comparisons.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow