Question

I'm using MS SQL Server 2005.

What's the best schema for a Wiki-like system? where users edit/revise a submission and the system keeps track of these submissions.

Lets say we're doing a simple wiki-based system. Will keep track of each revision plus views and latest activity of each revision. In other screens, the system will list "Latest Submissions" and "Most Viewed", plus search by title.

My current schema (and I know its bad) is using a single table. When I need to see the "Latest Submissions" I sort by "LatestActivity", group by "DocumentTitle", then take first N records. I assume a lot of grouping (especially grouping on nvarchar) is bad news. For listing the most viewed I also do the same: sort by views, group by name, take first N records. Most of the time, I will also be doing a "WHERE DocumentName LIKE '%QUERY-HERE%'".

My current schema is "Version 1", see below: alt text http://www.anaimi.com/junk/schemaquestion.png

I assume this is not acceptable. So i'm trying to come up with another/more-performant design. How does Version 2 sound to you? In version two I get the advantage of grouping on WikiHeadId which is a number - i'm assuming grouping over a number is better than nvarchar.

Or the extreme case which is version 3, where I will do no grouping, but has several disadvantages such as duplicating values, maintaining these values in code, etc.

Or is there a better/known schema for such systems?

Thanks.

(moved from ServerFault - i think its a development question more than an IT question)

Was it helpful?

Solution

Firstly (and out of curiosity) how does the current schema indicate what the current version is? Do you just have multiple 'WikiDocument' entries with the same DocumentTitle?

I'm also not clear on why you need a 'LastActivity' at a Version level. I don't see how 'LastActivity' fits with the concept of a 'Version' -- in most wikis, the 'versions' are write-once: if you modify a version, then you're creating a new version, so the concept of a last-updated type value on the version is meaningless -- it's really just 'datecreated'.

Really, the 'natural' schema for your design is #2. Personally, I'm a bit of a fan of the old DB axiom 'normalize until it hurts, then denormalize until it works'. #2 is a cleaner, nicer design (simple, with no duplication), and if you have no urgent reason to denormalize to version 3, I wouldn't bother.

Ultimately, it comes down to this: are you worrying about 'more performant' design because you've observed performance problems, or because you hypothetically might have some? There's no real reason #2 shouldn't perform well. Grouping isn't necessarily bad news in SQL Server -- in fact, if there's an appropriate covering index for the query, it can perform extremely well because it can just navigate to a particular level in the index to find the grouped values, then use the remaining columns of the index to use to MIN/MAX/whatever. Grouping by NVARCHAR isn't particularly bad -- if it's not observed to be a problem, don't fret about it, though (non-binary) collations can make it a little tricky -- but in version 2, where you need to GROUP BY you can do it by WikiHeadId, right?

One thing that may make life easier, if you do a lot of operations on the current version (as I assume you would), to add an FK back from the head table to the body table, indicating the current version. If you want to view the current versions with the highest number of hits, with #2 as it stands now it might be:

SELECT TOP ...
FROM WikiHead
INNER JOIN 
  (SELECT WikiHeadId, MAX(WikiBodyVersion) /* or LastUpdated? */ AS Latest 
   FROM WikiBody GROUP BY WikiHeadId) AS LatestVersions
INNER JOIN WikiBody ON 
  (Latest.WikiHeadId = WikiBody.WikiHeadId)
  AND (WikiBody.WikiBodyVersion = LatestVersions.Latest)
ORDER BY 
  Views DESC

or alternatively

...
INNER JOIN WikiBody ON 
  (WikiHead.WikiHeadId = WikiBody.WikiHeadId)
  AND (WikiBody.WikiBodyVersion = 
    (SELECT MAX(WikiBodyVersion) FROM WikiBody WHERE WikiBody.WikiHeadId = WikiHead.WikiHeadId)
...

both of which are icky. If the WikiHead keeps a pointer to the current version, it's just

...    
INNER JOIN WikiBody ON 
  (WikiHead.WikiHeadId = WikiBody.WikiHeadId)
  AND (WikiHead.Latest = WikiBody.WikiBodyVersion)
...

or whatever, which may be a useful denormalization just because it makes your life easier, not for performance.

OTHER TIPS

Check this out.

It's the database schema for mediawiki, what wikipedia is based on.

It looks pretty well documented and would be an interesting read for you.

From this page.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top