What do I need to know about working with huge databases?

https://stackoverflow.com/questions/3711633

02-10-2019
|

Question

I want to know what specific problems/solutions/advices/best-practices [don't punish me for the word] are arising while working with huge databases.

Under huge I imply databases, which have tables with millions of rows and/or databases with petabytes of data.

Platform-oriented answers will be great too.

Solution

Some ideas

Learn the details of the specific database engine, how it works
How to optimize queries (hints, execution plans)
How to tune the database (not only indexes, but physical storage and representation, OS integration).
Query "tricks" like temporary tables to store temporary results that can be reused,
How to evaluate the necessity of denormalization for performance improvement
How to use profiling tools for the database, to identify the bottlenecks.

OTHER TIPS

A couple of pieces of advice from a production DBA (my experience is MS SQL, but these should apply to other platforms):

Maintenance becomes a significant problem (nightly backups, DBCCs, weekly reindex/optimization jobs, etc). Very easy to start exceeding a reasonable nightly or weekend maintenance window. This isn't just a techical issue, its also a business issue ("what do you mean, it'll take 4 hours to restore the database from the last good backup?")
Developers need to understand that they may need to work differently. "You mean I can't just DELETE (500m rows) FROM MassiveTable and expect it to work?

I'm sure I'll think of more...

My first advice would be to hire someone who knows what they are doing and not rely on SO, otherwise you could be in for some extremely expensive mistakes. My second would be to choose the right platform hardware and software. The details will depend very much on requirements.

Highly recommend you to read this presentation about SQL Antipatterns http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back

The presentation will help (yes, it helped me a lot) find a solution to the seemingly deadlocked situation.

Any RDBMS can suffer from poor performance if it gets very large, especially when complex join conditions are in use. Database schemas need to be designed to scale for large amounts of traffic, too. Most systems are pretty good at handling loads, but you can also run into issues when you have one database that needs to be distributed across multiple machines.

A lot of new tools are popping up to deal with database scalability. One of the most promising is Memcached, which stores a lot of data in memory, which allows for much faster access and aids in synchronization between multiple database servers. Some of the NoSQL solutions, which augment traditional SQL systems with architectures that do not enforce schemas.

Some examples of NoSQL technologies are Cassandra, CouchDB, Google BigTable, MongoDB. Some people swear that these systems will become crucial in managing "the coming data explosion".

There are two aspects of a database that are more important than size, as far as design and management goes.

The first is complexity. How many user tables are there? How many columns in those tables? A database with several hundred user tables in the schema and over a thousand columns in those tables is very complex. A database with a half a dozen tables is not very complex, even if it contains petabytes of data.

The second is scope of data sharing. If a database is built to share data among six or more applications, developed by separate programming teams, you should design and manage it very differently than you would a database that's embedded in a single application.

Most of the database questions asked in SO pertain to single application databases.

Here are a few things to learn, in addition to what's already been mentioned.

Learn the difference between table partition and table decomposition. Some people decompose tables into multiple tables all with the same columns, when partitioning would serve them better.

Learn the real difference between the graph model of data and the relational model of data. Some people design databases as if foreign keys were essentially the same as pointers. What they end up with is a system that captures all the slowness of a relational system and all the unmanageability of a graph system.

(Note: the graph model is often called the hiearachical or network model).

Designing a real relational database is much more subtle, and much more worthwhile, than designing a database that pretends to be modeled relationally but is really graph modeled.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow