tsql query and index question

https://stackoverflow.com/questions/3322605

28-09-2020
|

Question

I have a table that stores photo information with id as Primary key:

id(PK), title, album_id, posted_by, published, filename, tags, ratings, date_posted

This table will hold infor of 100+ Million photos and I need to run this query like these frequently:

1) get all photos (just id,filename,title columns) of a given album

select id, filename, title from photos where album_id = @AlbumId and published = 1

2) get all published photos of a given user but exclude photos of currently viewing album

select id, filename, title from photos where posted_by='bob' and album_id <>10 and published = 1

I want to avoid index and table scanning. I need to use seek(say 100%) as much as possible.

Can this be done? What type of index and on which columns would help me achieve this?

Thanks

Solution

In reality, you'll only be able to find this out yourself by measuring performance before you tweak, then tweak, and measure again and again.

But based on your query, you should consider (or at least try this first) a non-clustered index like this:

CREATE NONCLUSTERED INDEX IX01_Photos
  ON dbo.Photos(album_id, published, posted_by)
  INCLUDE(id, filename, title)

Reasoning:

both your most frequent queries have WHERE clauses using album_id and published - so use these two columns first in your index
your second query also includes posted_by in the WHERE clause - put that into that same index as the third column
in order to avoid expensive bookmark lookups into the actual data table, you can include the id, filename, title columns in the index

With all these things in place, you should be seeing mostly index seeks on that new non-clustered index to satisfy your queries. But again: lots of other factors also come into play which you probably haven't mentioned in your question and possibly not even thought about yourself - but this approach should give you a good starting point no less.

OTHER TIPS

You didn't mention if there's a need to use the date_posted or the id as filter criteria in the query, so it might be best to use a CLUSTERED index on a non-chronological column (I'm assuming that the current CLUSTERED index is the PK. Right?).

I would create a CLUSTERED index on the album_id.

If you can't change the CLUSTERED index or there are many other queries that benefit from the existing clustered index, then I support the answer from @marc_s (and will vote accordingly.)

I would suggest a clustered index on album_id and a secondary index on posted_by, if the former is the one that will get hit most. Invert them if posted_by is hit most. Depending on how many photos there are for each album_id or posted_by, it may be quite feasible to filter on published in the calling code (in other words, don't add it as a restriction in the query, rather filter client-side). If not, you'll have to add that published constraint into the query, but the primary restriction of album_id should mean that only a small scan on published is incurred. But as stated, it may be easier just to filter on published client-side.

Primary key on Id. Make it non-clustered. I'd guess this won't be used much (particularly if all lookups are by album or poster).

Clustered index on AlbumId. Seems like it'd be used in most queries.

Non-clustered index on Posted_By. With AlbumId the clustered index, it will appear at the leaf-level of this index, and so act pretty much like an INCLUDEd column. Depending on usage it might be better to have this as the clustered index... but as a varchar(20), it'd take up more disk space, and performance would be worse than AlbumId (assuming AlbumId is an int).

You cannot have Published as a column in the index, as you cannot index on bit columns. Nor would you want to--with only two possible values across 100M+ rows, SQL would probably never use it for optimizing queries.

I'd recommend normalizing Posted_By (move it to its own table, give it its own surrogate key, and use that as a foreign key in this table). This would significantly reduce storage space in your main table, increase overall performance, and allow you to flip the clustered index to that column if need be. (Also, if "Bob" posts to the table, and then "Bob" from across town also posts, how do you tell Bob from Bob?)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow