Organizing and optimizing large tables

https://stackoverflow.com/questions/10319311

03-06-2021
|

Domanda

I'm building a predictive dialer where speed is critical. To dial the number I pull customer information from tables and build call files for a pbx to act on.

Currently I have a table for each area code, and we dial one area code at a time, but we are switching to a model wherein we dial based on Territories which span multiple Zip codes. Some area codes exist in multiple zip codes. Each table has new numbers added monthly, and is scrubbed by comparing against a do-not-call list of several million numbers.

So my question is, how should I most efficiently organize this data?

One large table seems counter productive, as we're talking millions of records of scrubbed data.

My current line of reasoning is to maintain the area code tables for importing and scrubbing, then copying the scrubbed records over to territory tables, created by searching the area code tables for the zip codes in the area.

I currently index the tables by an auto_incremented INT primary key, a unique phone number, and a status which tracks numbers that have already been called, or are on the do-not-call list. When building the call file, I mark the record as queued, then mark it according to how the call goes once it has completed, so for each call, there is a search and two updates.

The search looks in the area-code table for a certain status. The updates occur based on the record ID.

The meat of the question is this: Would it be quicker to organize by zip codes and search by status, or keep them organized by area code and search by status AND zip code? Or would the better bet be to create a new table each time we set up a territory built from the area code tables?

Forgive me if this seems like a dumb question, I've been teaching myself SQL as I've been building this, and the nuances of database design and performance are a bit beyond my skill set.

The total size of the tables is at 2 million rows and growing.

Soluzione

The meat of the question is this: Would it be quicker to organize by zip codes and search by status, or keep them organized by area code and search by status AND zip code? Or would the better bet be to create a new table each time we set up a territory built from the area code tables?

Answer: don't do any of these unless you really know what you're doing. Instead, create one table to hold all rows of this entity, using column values to distinguish between the various zip codes and territories. Possibly create zipcodes and territory tables, and add foreign keys referencing them.

Creating separate tables based on attribute value is not a typical solution, and will introduce many additional difficulties (for example, if you organize into tables by zip code, how do you search by territory across all zip codes?)

The more common solution, and one which databases excel at, is to use indexes. Using multiple indexes, a database can provide fast access to a table for searches on multiple different columns.

So the basic strategy I would recommend:

create a logical data model
implement the physical data model
analyze the performance
- explain <query> is very handy
- if it's not good enough, look at adding more indexes, improving use of existing indexes (read up on clustered and covering indexes), or selective denormalization
- what is the balance between selects and inserts? Indexes can slow down inserts

It's also important to note that two million rows is not a huge amount for MySQL (although of course, this depends on load). The bottom line is that optimization is a very tricky subject whose answer depends on your specific situation.

Altri suggerimenti

If you want speed, normalize the data is not what you want. The speed performance will be lower when the data grows.

The performance in this case will be tie to the speed of hard disks, a ssd might boost the performance a lot, but you will have space issues and are more expensive

A trade off can be use rotative disks and not normalize the data. Indexing the fields that you use to do the search.

Other strategies (more clever) can be use integer codes for data that can be repeat over the data set and use real values of zip codes, cities and so on from a memcache (the zip codes, country names, cities are data that are not mutable), but this approach adds new dependencies to the problem.

I have a table with 250 million rows,this information is tagged with country and city, postal code and ISP. I have ssd's to storing of main data and the geographical data is stored into memcached, when I need to do some searching, I have a logical layer to do the lookup and make the translation to code in database.

TaoNonnanes, There is no need for creating territory table every time for area code table.

Just made only one territory table with the foreign key of area code table just make the indexes for the territory and area code table and try to normalize your entire database at least till 3NF. I dont know what is your entire database normalization.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow