Question

Everywhere I look I see that whenever a site implement a tags system, they convert the tags names to lowercase. Even here in StackOverflow.

I was thinking about why is it so. Other than preventing duplication I can't think of a reason to use lowercase. I believe it hurts the practical aspect of the tags. People are used to read "IBM" not "ibm" and "C#" not "c#". It takes a bit more time for the user to understand whats the meaning of the tag, and I'm wondering if I should allow Capitals in my tags system, or is it a convention and I got it all wrong.

I want to hear your opinion.

Was it helpful?

Solution

Ask an engineer the reason why something is a certain way, and they'll go to great lengths to figure it out. ;)

In this case, I'd be inclined to explain the prevalence of lowercase by a combination of laziness (programmers not willing to consider the points you bring up) and imitation (once you see it done a certain way on site S, you tend to reimplement it for site S' with similar assumptions).

It certainly seems feasible to store tags in such a way that case doesn't matter (for purposes of sorting, querying and so on) but display the tags with the capitalization originally intended.

OTHER TIPS

As you already noticed, it prevents duplication. People are not consistent in their capitalization. Just look at the tags here and notice that people can't decide whether it's "objective-c", "objc" or "objectivec". Throw in "Objective-C", "Objective-c" and so on, and you'd have a real mess.

Note I'm not saying it would be impossible to deal with capitals, just difficult. For example, how do you know the correct capitalization? Just accept the first one entered as correct? Rely on moderators to clean up?

Different cases should be always be considered equivalent for tags.

Another reason to store your tags normalized. The single normalized version contains the accepted case, and tags are linked using many-to-many link table. Comparison against the tag table is done case-insensitive, so there will never be duplicates.

(I am not advising for any particular site or system in this answer - each specific system may have its own considerations)

I guess the reason is to prevent duplication and ease sorting or identification (it's easier if you do not need to consider multiple options). And possibly to maintain some consistency, as many web user interfaces are geared towards people that are likely to sometimes bother to capitalize correctly and otherwise not).

But then, those are a problems anyway because there is all too often more than one way to refer to something. If your tags are ever used as symbols in some sort of script, configuration, or code (e.g. mail filters, setting files, command lines), it's good to have some simple convention for specifing them, and if all symbols are of similar significance, allowing or distinguishing between different case variations, deliminations, etc. can be problematic. As a Unix user, I try to keep file names simple, short, lowercase, and without special characters, and moreso when they are (for example) mailbox names or source files - as they are likely to have to be typed, and specified in many contexts where doing otherwise will be inconvenient.

On the other hand, when using a sophisticated graphical or web-based interface which allows easy selection among a list, completion of typed entry, suggests closest matches, etc., it makes sense to allow some sort of mapping. Give each tag a short simple lowercase identifying name, but allow giving it also a "long" or "human" name, which will be shown where it makes sense. Tags can be uniquely identified and specified by their short name, but read more conveniently by their long name.

This is similar to how usernames work in many systems. I wouldn't choose a mixed-case username, and rather have usernames be treated case-insensitive (so I would just use the case that makes sense on the system I am in, which is lowercase in Unix but uppercase in some other old systems). Then, most systems have some other information stored about users, like their long or full name, which is nicer to read, and therefore many user interfaces (e.g. Windows XP, Mac OS, and I guess also some newer Unix desktop interfaces like GNOME and KDE) display on desktop login choosers, messages, etc.

In the case of tags for community systems on the web, I guess the solution to the duplication problem is some level of moderation to tags, even if just by the community itself, and the ability to rename and merge tags (unlike usernames in most cases) or edit their long names, in case something was mistagged.

I'd like to see tags being representative of what they categorise. In this respect, tags should follow the exact same form as the thing they are describing.

From a techincal point of view I see where the problems may arise, however; I don't see it being a reason not to fully investigate a solution.

I work in digital publishing and I can see the benefit of following correct usuage. On the flip side, you'd be hard pushed to see full-lowercase being used in a magazine, book or newspaper (unless it was stylistic choice).

http://en.wikipedia.org/wiki/List_of_case-sensitive_English_words

That said, the beauty of the english lexicon is it's ability to adapt, modify and evolve.

That sounds like a valid point to me. I'm sure they could come up with some simple parsing to capitalize each word (separated by dashes), but how would you know that its supposed to be IBM, instead of Ibm? I think someone would have to manually change the tag lookup table to accomplish this.

I agree that in principle this could be done in a more sophisticated manner. For example, you could implement a similarity metric that could recognize all of these as being likely synonyms:

  • IBM
  • ibm
  • I B M
  • I. B. M.
  • I.B.M.

However, there's a tradeoff between the increased runtime (not to mention development effort) and the increase in utility.

It's also been my general experience that as heuristics become more complex, their failure modes become more mysterious and bizarre. At least the convert-alphabetics-to-standard-case technique is easy for humans to understand and do in their heads when they have questions.

When typing, you would have to turn on caps lock to make everything upper-case. People are lazy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top