Domanda

I am trying to build a database that would contain a large number of records, each with a lot of columns(fields) - maybe around 200-300 fields total for all tables. Let's say that I would have in a few years about 40.000.000 to 60.000.000 records.

I plan to normalize the database, so I will have a lot of tables (about 30-40) -> and lots of joins for queries. Database will be strictly related to US, meaning that queries will be based on the 50 states alone (if a query is made, it won't allow to search/insert/etc in multiple states, but just one).

What can I do to have better performance?

Someone came with the idea to have all the states in different table structures, meaning I will have 50 tables * the 30-40 for the data (about 200 tables)! Should I even consider this type of approach?

The next idea was to use partitioning based on the US 50 states. How about this?

Any other way?

È stato utile?

Soluzione

The best optimization is determined by the queries you run, not by your tables' structure.

If you want to use partitioning, this can be a great optimization, if the partitioning scheme supports the queries you need to optimize. For instance, you could partition by US state, and that would help queries against data for a specific state. MySQL supports "partition pruning" so that the query would only run against the specific partition -- but only if your query mentions a specific value for the column you used as the partition key.

You can always check whether partition pruning is effective by using EXPLAIN PARTITIONS:

EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE state = 'NY';

That should report that the query uses a single partition.

Whereas if you need to run queries by date for example, then the partitioning wouldn't help; MySQL would have to repeat the query against all 50 partitions.

EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE date > '2013-05-01';

That would list all partitions. There's a bit of overhead to query all partitions, so if this is your typical query, you should probably use range partitioning by date.

So choose your partition key with the queries in mind.

Any other optimization technique follows a similar pattern -- it helps some queries, possibly to the disadvantage of other queries. So be sure you know which queries you need to optimize for, before you decide on the optimization method.


Re your comment:

Certainly there are many databases that have 40 million rows or more, but have good performance. They use different methods, including (in no particular order):

My point above is that you can't choose the best optimization method until you know the queries you need to optimize. Furthermore, the best choice may be different for different queries, and may even change over time as data or traffic grows. Optimization is an continual process, because you won't know where your bottlenecks are until after you see how your data grows and the query traffic your database receives.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top