Pergunta

Now the proposed table structures are:-

data_table
->impressions
->clicks
->ctr

OR

data_table_1
->ctr

data_table_2
->impressions
->clicks

What queries are executed? There are about 500 updates per second for the impressions. There is about 1 update for clicks every second. There are about 500 updates per second for the ctr.

Now my application sorts the data using the ctr. The ctr is the click through rate which is worked out by ctr = clicks/impressions. Now I have realised that unless there is a click update the ctr doesn't need to updated as all the impressions for articles are being increased which is decreasing the ctr in the same relationship, so unless there is a click the ctr does not need to be updated.

Currently the update query is like "UPDATE data_table SET impressions = impressions + 1, ctr = clicks / impressions WHERE something = something

This means that although 2 fields are updated at once only 1 query is executed.

Now the bottleneck is that these 500 updates on this causing slow down on selects on this table. There are about 20 selects per second. So I thought of separating the tables. The new table style proposes that the updates happen on a separate table and the selects happen on a separate table. The data table that contains the impressions is updated very frequently so having the updates for the impressions performed on it really speeds up the performance on this table. This means that the selects on the data_table_2 will be faster too and the ctr can be updated every time someone makes a click.

So, I just wanted to know if I should use the new table structure or not. What are you suggestions? Pros and Cons of my proposals!

Foi útil?

Solução

Maybe this is not a direct answer to your question, but i think it's important to be noted.

I think you should consider using nosql databases like Redis, MemcacheDB, MongDB, CouchDB. Relatational DBMS are not well suitable for this kind of use. For example, every time you update any column (UPDATE data_table SET impressions = impressions + 1) the caches are erased, and the DB has to hit the disk.

Other think you can consider is using Memcache and bulk that data to disk after some period of time.

For example, if you can afford to loose some impresions (remember that memcache does not persist data) you can do the impresions++ in memcache and update data in the DB every 5 minutes. It would decrease your load significantly.

I hope it helps you.

EDIT:

Storing CTR is a good idea, it's called "Denormalization", and may work in your application if it's a frequently required value.

Outras dicas

First of all, I assume the table is well indexed so the something = something predicate will quickly result in the corresponding row, right?

Further assuming that your bottleneck is disk-throughput because of the high update rate, what about not storing the ctr value at all, as it can be easily calculated on the fly? Since you seem to be limited by your update, only updating one field should roughly half the impact of having to write data to the disk. Given such scenario, where the CPU is probably relatively idle, calculating click/impressions for every result should be a non-issue. Your approach would pay off (again assuming disk is the limiting factor, which assume it is and can be found out easily by looking at CPU utilization), then your approach will give considerable benefits, iff the tables or on two different disks.

If the CPU turns out to be the limiting factor, then it's probably because the something = something predicate is quite complicated to evaluate in which case simplifying this should be the main concern, and not splitting the tables.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top