Why is “Select * from table” considered bad practice

https://softwareengineering.stackexchange.com/questions/234657

03-10-2020
|

Domanda

Yesterday I was discussing with a "hobby" programmer (I myself am a professional programmer). We came across some of his work, and he said he always queries all columns in his database (even on/in production server/code).

I tried to convince him not to do so, but wasn't so successful yet. In my opinion a programmer should only query what is actually needed for the sake of "prettiness", efficiency and traffic. Am I mistaken with my view?

Soluzione

Think about what you're getting back, and how you bind those to variables in your code.

Now think what happens when someone updates the table schema to add (or remove) a column, even one you're not directly using.

Using select * when you're typing queries by hand is fine, not when you're writing queries for code.

Altri suggerimenti

Schema Changes

Fetch by order --- If the code is fetching column # as the way to get the data, a change in the schema will cause the column numbers to readjust. This will mess up the application and bad things will happen.
Fetch by name --- If the code is fetching column by name such as foo, and another table in the query adds a column foo, the way this is handled may cause problems when trying to get the right foo column.

Either way, a schema change can cause problems with the extraction of the data.

Further consider if a column that was being used is removed from the table. The select * from ... still works but errors out when trying to pull the data out of the result set. If the column is specified in the query, the query will error out instead giving a clear indiciation as to what and where the problem is.

Data overhead

Some columns can have a significant amount of data associated with them. Selecting back * will pull all the data. Yep, here's that varchar(4096) thats on 1000 rows that you've selected back giving you an additional possible 4 megabytes of data that you're not needing, but is sent across the wire anyways.

Related to the schema change, that varchar might not exist there when you first created the table, but now its there.

Failure to convey intent

When you select back * and get 20 columns but only need 2 of them, you are not conveying the intent of the code. When looking at the query that does a select * one doesn't know what the important parts of it are. Can I change the query to use this other plan instead to make it faster by not including these columns? I don't know because the intent of what the query returns isn't clear.

Lets look at some SQL fiddles that explore those schema changes a bit more.

First, the initial database: http://sqlfiddle.com/#!2/a67dd/1

DDL:

create table one (oneid int, data int, twoid int);
create table two (twoid int, other int);

insert into one values (1, 42, 2);
insert into two values (2, 43);

SQL:

select * from one join two on (one.twoid = two.twoid);

And the columns you get back are oneid=1, data=42, twoid=2, and other=43.

Now, what happens if I add a column to table one? http://sqlfiddle.com/#!2/cd0b0/1

alter table one add column other text;

update one set other = 'foo';

And my results from the same query as before are oneid=1, data=42, twoid=2, and other=foo.

A change in one of the tables disrupts the values of a select * and suddenly your binding of 'other' to an int is going to throw an error and you don't know why.

If instead your SQL statement was

select 
    one.oneid, one.data, two.twoid, two.other
from one join two on (one.twoid = two.twoid);

The change to table one would not have disrupted your data. That query runs the same before the change and after the change.

Indexing

When you do a select * from you are pulling all the rows form all the tables that match the conditions. Even tables you really don't care about. While this means more data is transferred there's another performance issue lurking further down the stack.

Indexes. (related on SO: How to use index in select statement?)

If you are pulling back lots of columns the database plan optimizer may disregard using an index because you are still going to need to fetch all those columns anyways and it would take more time to use the index and then fetch all of the columns in the query than it would be just to do a complete table scan.

If you are just selecting the, say, last name of a user (which you do a lot and so have an index on it), the database can do an index only scan (postgres wiki index only scan, mysql full table scan vs full index scan, Index-Only Scan: Avoiding Table Access).

There is quite a bit of optimizations about reading only from indexes if possible. The information can be pulled in faster on each index page because you're pulling less of it also - you're not pulling in all those other columns for the select *. It is possible for an index only scan to return results on the order of 100x faster (source: Select * is bad).

This isn't saying that a full index scan is great, its still a full scan - but its better than a full table scan. Once you start chasing down all the ways that that select * hurts performance you keep finding new ones.

Related reading

Confusion about proper use of * wildcard in SQL
(Stack Overflow): select * vs select column
(Stack Overflow): Why is SELECT * considered harmful?

Another concern: if it's a JOIN query and you're retrieving query results into an associative array (as could be the case in PHP), it's bug-prone.

The thing is that

if table foo has columns id and name
if table bar has columns id and address,
and in your code you are using SELECT * FROM foo JOIN bar ON foo.id = bar.id

guess what happens when someone adds a column name to the bar table.

The code will suddenly stop working properly, because now the name column appears in the results twice and if you're storing the results into an array, data from second name (bar.name) will overwrite the first name (foo.name)!

It's quite a nasty bug because it's very non-obvious. It can take a while to figure out, and there's no way the person adding another column to the table could have anticipated such undesirable side effect.

(True story).

So, don't use *, be in control of what columns you are retrieving and use aliases where appropriate.

Querying every column might be perfectly legitimate, in many cases.

Always querying every column isn't.

It's more work for your database engine, which has to go off and rummage around its internal metadata to work out which columns it needs to deal with before it can get on with the real business of actually getting the data and sending it back to you. OK, it's not the biggest overhead in the world, but system catalogs can be an appreciable bottleneck.

It's more work for your network, because you're pulling back any number of fields when you might only want one or two of them. If somebody [else] goes and adds a couple of dozen extra fields, all of which contains big chunks of text, you're throughput suddenly goes through the floor - for no readily apparent reason. This is made worse if your "where" clause isn't particularly good and you're pulling back lots of rows as well - that's potentially a lot of data tromping its way across the network to you (i.e. it's going to be slow).

It's more work for your application, having to pull back and store all of this extra data that it quite probably doesn't care about.

You run the risk of columns changing their order. OK, you shouldn't have to worry about this (and you won't if you select only the columns you need) but, if you go get them all at once and somebody [else] decides to rearrange the column order within the table, that carefully crafted, CSV export that you give to accounts down the hall suddenly goes all to pot - again, for no readily apparent reason.

BTW, I've said "someone [else]" a couple of times, above. Remember that databases are inherently multi-user; you may not have the control over them that you think you do.

The short answer is: it depends on what database they use. Relational databases are optimized for extracting the data you need in a fast, reliable and atomic way. On large datasets and complex queries it's much faster and probablly safer than SELECTing * and do the equivalent of joins on the 'code' side. Key-value stores might not have such functionalities implemented, or might not be mature enough to use in production.

That said, you can still populate whatever data structure you're using with SELECT * and work out the rest in code but you'll find performance bottlenecks if you want to scale.

The closest comparison is sorting data: you can use quicksort or bubblesort and the result will be correct. But won't be optimized, and definitely will have issues when you introduce concurrency and need to sort atomically.

Of course, it's cheaper to add RAM and CPUs than investing in a programmer that can do SQL queries and has even a vague understanding of what a JOIN is.

IMO, its about being explicit vs implicit. When I write code, I want it to work because I made it work, not just because all of the parts just happen to be there. If you query all records and your code works, then you'll have the tendency to move on. Later on if something changes and now your code doesn't work, its a royal pain to debug lots of queries and functions looking for a value that should be there and the only values reference are *.

Also in an N-tiered approach, its still best to isolate database schema disruptions to the data tier. If your data tier is passing * to the business logic and most likely on the the presentation tier, you are expanding your debugging scope exponentially.

because if the table gets new columns then you get all those even when you don't need them. with varchars this can become a lot of extra data that needs to travel from the DB

some DB optimizations may also extract the non fixed length records to a separate file to speed up access to the fixed length parts, using select* defeats the purpose of that

Apart from overhead, something you want to avoid in the first place, I would say that as an programmer you don't depend on column order defined by the database administrator. You select each column even if you need them all.

I don't see any reason why you shouldn't use for the purpose it's build - retrieve all the columns from a database. I see three cases:

A column is added in the database and you want it in code also. a) With * will fail with a proper message. b) Without * will work, but won't do what you expect which is pretty bad.
A column is added in database and you do not want it in code. a) With * will fail; this means that * does no longer applies since it's semantics means "retrieve all". b) Without * will work.
A column is removed Code will fail either way.

Now the most common case is case 1 (since you used * which means all you most probably want all); without * you can have code that works fine but doesn't do what expected which is much, much worst that code that fails with a proper error message.

I'm not taking into consideration the code which retrieves the column data based on column index which is error-prone in my opinion. It's much more logic to retrieve it based on column name.

Think of it this way... if you query all columns from a table that has just a few small string or numeric fields, that total 100k of data. Bad practice, but it will perform. Now add a single field that holds, say, an image or a 10mb word document. now your fast performing query immediately and mysteriously start performing poorly, just because a field was added to the table... you may not need that huge data element, but because you've done Select * from Table you get it anyway.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a softwareengineering.stackexchange