Why Isn’t SQL More Refactorable? [closed]

https://softwareengineering.stackexchange.com/questions/391228

24-02-2021
|

Pregunta

Everyone knows that new developers write long functions. As you progress, you get better at breaking your code into smaller pieces and experience teaches you the value of doing so.

Enter SQL. Yes, the SQL way of thinking about code is different from the procedural way of thinking about code, but this principle seems just as applicable.

Let’s say I have a query that takes the form:

select * from subQuery1 inner join subQuerry2 left join subquerry3 left join join subQuery4

Using some IDs or dates etc.

Those subqueries are complex themselves and may contain subqueries of their own. In no other programming context would I think that the logic for complex subqueries 1-4 belongs in line with my parent query that joins them all. It seems so straightforward that those subqueries should be defined as views, just like they would be functions if I were writing procedural code.

So why isn’t that common practice? Why do people so often write these long monolithic SQL queries? Why doesn’t SQL encourage extensive view usage just like procedural programming encourages extensive function usage. (In many enterprise environments, creating views isn’t even something that’s easily done. There are requests and approvals required. Imagine if other types of programmers had to submit a request each time they created a function!)

I’ve thought of three possible answers:

This is already common and I’m working with inexperienced people
Experienced programmers don’t write complex SQL because they prefer to solve hard data processing problems with procedural code
Something else

Solución

I think the main problem is that not all databases support Common Table Expressions.

My employer uses DB/2 for a great many things. The latest versions of it support CTEs, such that I'm able to do things like:

with custs as (
    select acct# as accountNumber, cfname as firstName, clname as lastName,
    from wrdCsts
    where -- various criteria
)
, accounts as (
    select acct# as accountNumber, crBal as currentBalance
    from crzyAcctTbl
)
select firstName, lastName, currentBalance
from custs
inner join accounts on custs.accountNumber = accounts.accountNumber

The result is that we can have heavily abbreviated table / field names and I'm essentially creating temp views, with more legible names, which I can then use. Sure, the query gets longer. But the result is that I can write something which is pretty clearly separated (using CTEs the way you'd use functions to get DRY) and end up with code that's quite legible. And because I'm able to break out my subqueries, and have one subquery reference another, it's not all "inline." I have, on occasion, written one CTE, then had four other CTEs all reference it, then had the main query union the results of those last four.

This can be done with:

DB/2
PostGreSQL
Oracle
MS SQL Server
MySQL (latest version; still kinda new)
probably others

But it goes a LONG way toward making the code cleaner, more legible, more DRY.

I've developed a "standard library" of CTEs that I can plug-in to various queries, getting me off to a flying start on my new query. Some of them are starting to be embraced by other devs in my organization, too.

In time, it may make sense to turn some of these into views, such that this "standard library" is available without needing to copy / paste. But my CTEs end up getting tweaked, ever so slightly, for various needs that I've not been able to have a single CTE get used SO WIDELY, without mods, that it might be worth creating a view.

It would seem that part of your gripe is "why don't I know about CTEs?" or "why doesn't my DB support CTEs?"

As for updates ... yeah, you can use CTEs but, in my experience, you have to use them inside the set clause AND in the where clause. It would be nice if you could define one or more ahead of the whole update statement and then just have the "main query" parts in the set / where clauses but it doesn't work that way. And there's no avoiding obscure table / field names on the table you're updating.

You can use CTEs for deletes. It may take multiple CTEs to determine the PK / FK values for records you want to delete from that table. Again, you can't avoid obscure table / field names on the table you're modifying.

Insomuch as you can do a select into an insert, you can use CTEs for inserts. As always, you may be dealing with obscure table / field names on the table you're modifying.

SQL does NOT let you create the equivalent of a domain object, wrapping a table, with getters / setters. For that, you will need to use an ORM of some kind, along with a more procedural / OO programming language. I've written things of this nature in Java / Hibernate.

Otros consejos

Locking down the creation of database views is often done by organizations paranoid of performance problems in the database. This is an organizational culture issue, rather than a technical issue with SQL.

Beyond that, large monolithic SQL queries are written many times, because the use case is so specific that very little of the SQL code can be truly reused in other queries. If a complex query is needed, it is usually for a much different use case. Copying the SQL from another query is often a starting point, but due to the other sub queries and JOINs in the new query, you end up modifying the copied SQL just enough to break any sort of abstraction that a "function" in another language would be used for. Which brings me to the most important reason why SQL is hard to refactor.

SQL only deals with concrete data structures, not abstract behavior (or an abstraction in any sense of the word). Since SQL is written around concrete ideas, there is nothing to abstract away into a reusable module. Database views can help with this, but not to the same level as a "function" in another language. A database view isn't so much an abstraction as it is a query. Well, actually, a database view is a query. It's essentially used like a table, but executed like a sub query, so again, you are dealing with something concrete, not abstract.

It is with abstractions that code becomes easier to refactor, because an abstraction hides implementation details from the consumer of that abstraction. Straight SQL provides no such separation, although procedural extensions to SQL like PL/SQL for Oracle or Transact-SQL for SQL Server start to blur the lines a little.

The thing that I think you might be missing from your question / point of view is that SQL executes operations on sets (using set operations etc.).

When you operate on that level you, naturally, give up certain control over to the engine. You can still force some procedural style code using cursors but as experience shows 99/100 times you shouldn't be doing so.

Refactoring SQL is possible but it's not using the same code refactoring principles like we're used to in application level code. Instead you optimize how you use the SQL engine itself.

This can be done in various ways. If you're using Microsoft SQL Server you can use SSMS to provide you with an approximate execution plan and you can use that to see which steps you can do to tune your code.

In the case of splitting code out into smaller modules, as @greg-burghardt mentioned, SQL is generally a purpose built piece of code and as a result. It does that one thing you need it to do and nothing else. It's adhering to the S in SOLID, it has only one reason to be changed / affected and that's when you need that query to do something else. The rest of the acronym (OLID) doesn't apply here (AFAIK there's no dependency injection, interfaces or dependencies as such in SQL) depending on the flavor of the SQL you're using you might be able to extend certain queries by wrapping them in a stored procedure / table function or using them as sub-queries so, I'd say the open-closed principle would still apply, in a way. But I digress.

I think you need to shift your paradigm in terms of how you're viewing SQL code. Due to the set nature of it it can't provide a lot of the features application level languages can (generics etc.). SQL was never designed to be anything like that, it's a language to query sets of data, and each set is unique in its own way.

That being said, there are ways in which you can make your code look nicer, if readability is a high priority within the organization. Storing bits of frequently used SQL blocks (common data sets that you use) into stored procedures / table value functions and then querying and storing them in temporary tables / table variables, followed by using those to join up the pieces together into the one massive transaction that you'd otherwise write is an option. IMHO it's not worth doing something like that with SQL.

As a language it's designed to be easily readable and understandable by anyone, even non-programmers. As such, unless you're doing something very clever, there's no need to refactor SQL code into smaller byte size pieces. I've, personally, written massive SQL queries whilst working on a data warehouse ETL / Reporting solution and everything was still very clear in terms of what was going on. Anything that might have looked a bit weird to anyone else would get a brief set of comments alongside it to provide a brief explanation.

I hope this helps.

I'm am going to focus on the "subqueries" in your example.

Why are they used so often? Because they use the natural way of thinking of a person: I have this set of data, and want to do an action on a subset of it and join that with a subset of other data. 9 out of 10 times that I see a subquery, it's used wrong. My running joke about subqueries is: people who are afraid of joins use subqueries.

If you see such subqueries it's also often a sign of non-optimal database design.

The more Normalized your Database is, the more joins you get, the more your database looks like a big excel-sheet, the more subselects you get.

Refactoring in SQL is often with a different goal: get more performance, better query times, "avoiding table scans". Those may even make the code less readable but are very valuable.

So why do you see so many huge monolithic non-refactored queries?

SQL, in many ways is not a programming language.
Bad database design.
People not really fluent in SQL.
No power over the database (for instance not being allowed to use views)
Different goals with refactoring.

(for me, the more experienced I get with SQL, the less big my queries get, SQL has ways for people of all skill levels to get their jobs done no-matter what.)

Segregation of duties

In the SQL spirit, the database is a shared asset that contains the company's data, and protecting it is of vital importance. Enters the DBA as guardian of the temple.

Creating a new view in the database is understood to serve a lasting purpose and to be shared by a community of users. In the DBA view, this is acceptable only if the view is justified by the structure of the data. Every change of a view is then associated with risks for all its current users, even those not using the application but who have discovered the view. Finally, creation of new objects requires manage authorisations, and in the case of view, consistently with the authorisations of the underlying tables.

All this explains why DBAs don't like adding views that are just for the code of some individual application.

SQL design

If you decompose one of your nice complex query, you might find out that the subqueries will often need a parameter that depends on another subquery.

So transforming subqueries in view is not necessarily as simple as stated. You must isolate the variable parameters, and design your view so that the parameters can be added as selection criteria on the view.

Unfortunately, in doing so, you sometimes impose to access more data and less effectively than in a tailored query.

Proprietary extensions

You could hope some refactoring, by transfering some responsibilities to procedural extensions of SQL, like PL/SQL or T-SQL. However, these are vendor dependent and create a technological dependency. In addition, these extension execute on the database server, creating more processing load on a resource that is much more difficult to scale than an application server.

But what's the problem in the end ?

Finally, are segregation of duties and the SQL design with its strength and limitations a real problem ? In the end, these databases proved to successfully and reliably handle very critical data including in mission critical environments.

So in order to achieve a successful refactoring:

consider a better communication. Try to understand your DBA's constraints. If you prove to a DBA that a new view is justified by the data structures, that it is not a throw-away workaround, and that it doesn't have a security impact, he/she will certainly agree to let it be created. Because, then it would be a shared interest.
clean your own house first: Nothing forces you to generate a lot of SQL in a lot of places. Refactor your application code, to isolate the SQL accesses, and to create the classes or functions to provide reusable subqueries, if these are frequently used.
improve team-awareness: make sure that your application is not performing tasks that could be performed more efficiently by the DBMS engine. As you rightly pointed out, the procedural approach and the data oriented approach are not equally mastered by different members of the team. It depends on their background. But in order to optimize the system as a whole, your team need to understand it as a whole. So create awareness, so to be sure that less experienced players do not reinvent the wheel and share their DB thoughts with more experienced members.

Re points 1 & 3: Views aren't the only way. There are also temporary tables, marts, table variables, aggregated columns, CTEs, functions, stored procedures and possibly other constructs depending on the RDBMS.

DBAs (and I'm speaking as someone who has been both DBA and developer) tend to view the world in a pretty binary way so are often against things like views and functions due to the perceived performance penalty.

Latterly, the need for complex joins has reduced with the recognition that denormalised tables despite being sub-optimal from a NF point of view, are highly performant.

There is also the trend for doing queries client side with technologies like LINQ which you raise in point 2.

While I agree that SQL can be challenging to modularise, great strides have been made although there will always be a dichotomy between client side code and SQL - although 4GL has blurred the lines somewhat.

I guess it really depends on how far your DBAs/architects/tech leads are willing to cede in this regard. If they refuse to allow anything but vanilla SQL with lots of joins, huge queries could result. If you're stuck with this, don't bang your head on a brick wall, escalate it. There are generally better ways of doing things with a bit of compromise - especially if you can prove the benefits.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a softwareengineering.stackexchange