What do relational databases gain by setting a predefined data type for each column?

https://softwareengineering.stackexchange.com/questions/349660

12-01-2021
|

Question

I'm working with an SQL database right now, and this has always made me curious, but Google searches don't turn much up: Why the strict data types?

I understand why you'd have a few different data types, for example like how differentiating between binary and plain text data is important. Rather than storing the 1s and 0s of binary data as plaintext, I now understand that it's more efficient to store the binary data as its own format.

But what I don't understand is what the benefit is of having so many different data types:

Why mediumtext, longtext, and text?
Why decimal, float, and int?
etc.

What is the benefit of telling the database "There'll only be 256 bytes of plain text data in entries to this column." or "This column can have text entries of up to 16,777,215 bytes"?

Is it a performance benefit? If so, why does knowing the size of the entry before hand help performance? Or rather is it something else altogether?

Solution

SQL is a statically-typed language. This means you have to know what type a variable (or field, in this case) is before you can use it. This is the opposite of dynamically-typed languages, where that is not necessarily the case.

At its core, SQL is designed to define data (DDL) and access data (DML) in a relational database engine. Static typing presents several benefits over dynamic typing to this type of system.

Indexes, used for quickly accessing specific records, work really well when the size is fixed. Consider a query that utilizes an index, possibly with multiple fields: if the data types and sizes are known ahead of time, I can very quickly compare my predicate (WHERE clause or JOIN criteria) against values in the index and find the desired records faster.
Consider two integer values. In a dynamic type system, they may be of variable size (think Java BigInteger, or Python's built-in arbitrary-precision integers). If I want to compare the integers, I need to know their bit length first. This is an aspect of integer comparison that is largely hidden by modern languages, but is very real at the CPU level. If the sizes are fixed and known ahead of time, an entire step is removed from the process. Again, databases are supposed to be able to process zillions of transactions as quickly as possible. Speed is king.
SQL was designed back in the 1970s. In the earlier days of microcomputing, memory was at a premium. Limiting data helped keep storage requirements in check. If an integer never grows past one byte, why allocate more storage for it? That is wasted space in the era of limited memory. Even in modern times, those extra wasted bytes can add up and kill the performance of a CPU's cache. Remember, these are database engines that may be servicing hundreds of transactions per second, not just your little development environment.
Along the lines of limited storage, it is helpful to be able to fit a single record in a single page in memory. Once you go over one page, there are more page misses and more slow memory access. Newer engines have optimizations to make this less of an issue, but it is still there. By sizing data appropriately, you can mitigate this risk.
Moreso in modern times, SQL is used to plug in to other languages via ORM or ODBC or some other layer. Some of these languages have rules about requiring strong, static types. It is best to conform to the more strict requirements, as dynamically-typed languages can deal with static types easier than the other way around.
SQL supports static typing because database engines need it for performance, as shown above.

It is interesting to note that there are implementations of SQL that are not strongly-typed. SQLite is probably the most popular example of such a relational database engine. Then again, it is designed for single-threaded use on a single system, so the performance concerns may not be as pronounced as in e.g. an enterprise Oracle database servicing millions of requests per minute.

OTHER TIPS

First: plain text is binary (it's not even the UTF8 or ASCII characters "0" and "1" but actual on/off bits)

That said, some of the reasons are:

Business/design constraints: allowing the number 7626355112 in the HEIGHT column of the PERSON table would be wrong. Allowing "Howya" in the DATE column of an INVOICE would be wrong.
Less error prone code: you don't have to write code to make sure the data retrieved from a date column is really a date. If column types were dynamic you would have to make a lot of type checks when reading them.
Computing efficiency: If a column is of type INTEGER, and you SUM() it, the RDBMS doesn't have to apply floating point arithmetics.
Storage efficiency: stating that a column is VARCHAR(10) lets the RDBMS allocate space more precisely.
Referential integrity and unicity: PK (or FKs) of a table shouldn't allow floats, since floating point equality is tricky, so you must declare them in a non-float type, like characters or integer.
There exist RDBMSs whith dynamic (not strict) column types (SQLite). It uses the concept of "type affinity" while still allowing you to insert virtually anything into any column without complaining. There are trade-offs that wil not be discussed here. See this question.

It is so that the underlying code that the database is written in can allocate and use fixed size records, if it knows that a specific field can contain 0 to 256 characters of text then it can allocate a block of 256 bytes to store it in.

This makes things much faster, e.g. you are not having to allocate additional storage as the user types, since a given field always starts x bytes into the record a search or select on that field knows to always check x bytes into each record, etc.

When the columns of a database are given defined types, the types are usually defined themselves to have a certain size in bits. As a result:

1) when the database engine is traversing the rows in a table it doesn't have to do any fancy parsing to determine where each record ends, it can just know that each row consists of, say, 32 bytes, and so to get the next record it's sufficient to add 32 bytes to the current records location.

2) when looking up a field within a row, it is possible to know an exact offset for that field again without parsing anything, so column lookups are a simple arithmetic operation rather than a potentially costly data processing one.

You asked why DBMSs have static data types.

Speed of lookup. The whole point of a DBMS is to store far more data than you could possibly load into a program. Think "all the credit card slips generated in the world in the last ten years". In order to search such data efficiently, fixed length data types are helpful. This is especially true for structured data like date stamps and account numbers. If you know what you're dealing with ahead of time, it's easier to load into efficient indexes.
Integrity and constraints. It's easier to keep data clean if it has fixed data types.
History. RDBMSs got started when computers had but a few megabytes of RAM, and terabyte-scale storage was enormously expensive. Saving a dozen bytes in each row of a table could save thousands of dollars and hours of time under those circumstances.
The curse of the customer base. RDBMSs today are very complex, highly optimized, software packages, and they have been in use for decades accumulating data. They're mature. They work. A RDBMS crash resulting in large scale data loss is vaninshingly rare these days. Switching to something with a more flexible data typing system isn't worth the cost or risk to most organizations.

Analogy: it may be blindlying obvious that urban subway systems would work better (quieter, faster, more power-efficient) on a narrower rail gauge. But how are you going to change all the rails in the New York City subway system to realize those improvements? You aren't, so you optimize what you have.

In general, the more detail you tell the database about what you're storing, the more it can try to optimize various performance metrics related to that data, such as how much space to allocate on disc or how much memory to allocate when retrieving it.

Why mediumtext, longtext, and text?

Not sure which database your using so I will have to guess: I'd guess that two of these datatypes have upper limits, one of them does not. Using datatypes for text that have upper limits tell the database how much storage space it will need for each record. It's also possible that some databases might have different ways of storing large (possibly unlimited) text vs. small fixed-length text (this may vary by database, check your manual to see about yours).

Why decimal, float, and int?

Different levels of precision require different amounts of storage, and not every use requires highest degrees of precision. For example, see here: https://docs.oracle.com/cd/B28359_01/server.111/b28286/sql_elements001.htm#SQLRF50950

Oracle has a quite a number of different numeric types with different storage requirements and different capabilities in terms of level of precision and size of number that can be represented.

To some extent, it's historical.

Once upon a time, tabular data was stored in files composed of fixed-length records in turn composed of pre-defined fields such that a given field was always of the same type and in the same place in each and every record. This made processing efficient and limited the complexity of coding.

Add some indexes to such a file and you have the beginnings of a relational database.

As relational databases evolved, they began to introduce more data types and storage options, including variable-length text or binary fields. But, this introduced variable-length records, and broke the ability to consistently locate records via calculation or fields via a fixed offset. No matter, machines are much more powerful today than they were back then.

Sometimes it's useful to set a specific size for a field to help enforce some bit of business logic - say 10 digits for a North American phone number. Much of the time it is just a bit of computing legacy.

If a database uses fixed-sized records, any record in the database will continue to fit, in the same location, even if its contents are changed. By contrast, if a database tries to store records using exactly the amount of storage needed for their fields, changing Emma Smith's name to Emma Johnson may cause her record to be too big to fit in its present location. If the record is moved to someplace with enough room, any index that keeps track of where it is would need to be updated to reflect the new location.

There are a variety of ways to reduce the cost associated with such updates. For example, if the system maintains a list of record numbers and data locations, that list will be the only thing that would need to be updated if a record moves. Unfortunately, such approaches still have significant cost (e.g. keeping a mapping between record numbers and locations would require that record retrieval would require an extra step to retrieve the data associated with a given record number). Using fixed-sized records may seem inefficient, but it makes things a lot simpler.

For a lot of what you do as a web developer, there is no need to understand what's happening "under the hood". There are times, however, when it helps.

What is the benefit of telling the database "There'll only be 256 bytes of plain text data in entries to this column." or "This column can have text entries of up to 16,777,215 bytes"?

As you suspect, the reason is to do with efficiency. The abstractions leak. A query like SELECT author FROM books can run quite quickly when the size of all fields in the table are known.

As Joel says,

How does a relational database implement SELECT author FROM books? In a relational database, every row in a table (e.g. the books table) is exactly the same length in bytes, and every fields is always at a fixed offset from the beginning of the row. So, for example, if each record in the books table is 100 bytes long, and the author field is at offset 23, then there are authors stored at byte 23, 123, 223, 323, etc. What is the code to move to the next record in the result of this query? Basically, it’s this:

pointer += 100;

One CPU instruction. Faaaaaaaaaast.

A lot of the time, you're working far enough away from the nitty gritty underpinnings that you don't need to care about it. As a PHP-based web dev, do you care about how many CPU instructions your code uses? Most of the time, no, not really. But sometimes it's useful to know, for two reasons: it can explain decisions made by your libraries; and sometimes you do need to care about speed in your own code.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange