Question

Suppose I have many tagged entities (e.g. blog posts with tags) to store in a SQL database. For example:

post1: work
post2: work, programming, java, work
post3: work, programming, sql
post4: vacation, photo
post5: vacation
post6: photo

Suppose also I have a list of tags

work, vacation

Now I'd like to get a posts sample of size 2, i.e. two posts with tags from the list. For example

sample1: post1 and post2
sample2: post1 and post4
sample3: post2 and post5

In addition I'd like the sample to contain all tags in the list. Note that sample1 does not meet this requirement since the set of tags of the sample entities does not contain tag vacation from the list.

I would like also all tags occurrences to be equal. Let's consider 2 samples of size 4.

sample1: post1, post2, post3, post6
sample2: post1, post3, post4, post5

Note that sample1 does not meet this requirement since tag work occurs 3 times in it and vacation occurs only once.

My question is: how to design a relational database and SQL query to retrieve samples of given size?

Was it helpful?

Solution

If you want to get all posts that have tags in a comma delimited list:

select postid
from post_tags
where find_in_set(tagid, @LIST) > 0
group by postid
having count(distinct tagid) = 1+length(@LIST) - length(replace(',', @LIST, ''));

If you want just a "sample" of them:

select postid
from (select postid
      from post_tags
      where find_in_set(tagid, @LIST) > 0
      group by postid
      having count(distinct tagid) = 1+length(@LIST) - length(replace(',', @LIST, ''))
     ) t
order by rand()
limit 5
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top