Question

I've heard of a few ways to implement tagging; using a mapping table between TagID and ItemID (makes sense to me, but does it scale?), adding a fixed number of possible TagID columns to ItemID (seems like a bad idea), Keeping tags in a text column that's comma separated (sounds crazy but could work). I've even heard someone recommend a sparse matrix, but then how do the tag names grow gracefully?

Am I missing a best practice for tags?

Was it helpful?

Solution

Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID

OTHER TIPS

Normally I would agree with Yaakov Ellis but in this special case there is another viable solution:

Use two tables:

Table: Item
Columns: ItemID, Title, Content
Indexes: ItemID

Table: Tag
Columns: ItemID, Title
Indexes: ItemId, Title

This has some major advantages:

First it makes development much simpler: in the three-table solution for insert and update of item you have to lookup the Tag table to see if there are already entries. Then you have to join them with new ones. This is no trivial task.

Then it makes queries simpler (and perhaps faster). There are three major database queries which you will do: Output all Tags for one Item, draw a Tag-Cloud and select all items for one Tag Title.

All Tags for one Item:

3-Table:

SELECT Tag.Title 
  FROM Tag 
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 WHERE ItemTag.ItemID = :id

2-Table:

SELECT Tag.Title
FROM Tag
WHERE Tag.ItemID = :id

Tag-Cloud:

3-Table:

SELECT Tag.Title, count(*)
  FROM Tag
  JOIN ItemTag ON Tag.TagID = ItemTag.TagID
 GROUP BY Tag.Title

2-Table:

SELECT Tag.Title, count(*)
  FROM Tag
 GROUP BY Tag.Title

Items for one Tag:

3-Table:

SELECT Item.*
  FROM Item
  JOIN ItemTag ON Item.ItemID = ItemTag.ItemID
  JOIN Tag ON ItemTag.TagID = Tag.TagID
 WHERE Tag.Title = :title

2-Table:

SELECT Item.*
  FROM Item
  JOIN Tag ON Item.ItemID = Tag.ItemID
 WHERE Tag.Title = :title

But there are some drawbacks, too: It could take more space in the database (which could lead to more disk operations which is slower) and it's not normalized which could lead to inconsistencies.

The size argument is not that strong because the very nature of tags is that they are normally pretty small so the size increase is not a large one. One could argue that the query for the tag title is much faster in a small table which contains each tag only once and this certainly is true. But taking in regard the savings for not having to join and the fact that you can build a good index on them could easily compensate for this. This of course depends heavily on the size of the database you are using.

The inconsistency argument is a little moot too. Tags are free text fields and there is no expected operation like 'rename all tags "foo" to "bar"'.

So tldr: I would go for the two-table solution. (In fact I'm going to. I found this article to see if there are valid arguments against it.)

If you are using a database that supports map-reduce, like couchdb, storing tags in a plain text field or list field is indeed the best way. Example:

tagcloud: {
  map: function(doc){ 
    for(tag in doc.tags){ 
      emit(doc.tags[tag],1) 
    }
  }
  reduce: function(keys,values){
    return values.length
  }
}

Running this with group=true will group the results by tag name, and even return a count of the number of times that tag was encountered. It's very similar to counting the occurrences of a word in text.

Use a single formatted text column[1] for storing the tags and use a capable full text search engine to index this. Else you will run into scaling problems when trying to implement boolean queries.

If you need details about the tags you have, you can either keep track of it in a incrementally maintained table or run a batch job to extract the information.

[1] Some RDBMS even provide a native array type which might be even better suited for storage by not needing a parsing step, but might cause problems with the full text search.

I've always kept the tags in a separate table and then had a mapping table. Of course I've never done anything on a really large scale either.

Having a "tags" table and a map table makes it pretty trivial to generate tag clouds & such since you can easily put together SQL to get a list of tags with counts of how often each tag is used.

I would suggest following design : Item Table: Itemid, taglist1, taglist2
this will be fast and make easy saving and retrieving the data at item level.

In parallel build another table: Tags tag do not make tag unique identifier and if you run out of space in 2nd column which contains lets say 100 items create another row.

Now while searching for items for a tag it will be super fast.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top