Measuring the complexity of SQL statements

https://stackoverflow.com/questions/3353634

02-10-2020
|

Вопрос

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?

It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?

[Edit/Note] While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.

[Edit/Note 2] Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.

Решение

Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).

The "control flow" in a SQL query is best related to "and" and "or" operators in query.

The "computational complexity" is best related to operators such as SUM or implicit JOINS.

Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.

What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.

Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.

Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.

Другие советы

I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".

Ultimately it's about the chances of making a mistake, or not understanding the code I've written?

Something like:

number of tables times (1
+1 per join expression (+1 per outer join?)
+1 per predicate after WHERE or HAVING
+1 per GROUP BY expression
+1 per UNION or INTERSECT
+1 per function call
+1 per CASE expression
)

Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -

Calculate TSQL Stored Procedure Complexity

SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.

Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by: the number of joins the number of where conditions the number of functions the number of subqueries the number of casts to differnt datatypes the number of case statements the number of loops or cursors the number of steps in a transaction

However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.

In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.

Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).

Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.

In programming languages we have several methods to compute the time complexity or space complexity.

Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.

Like a simple row by row query

   Select * from table ; 
  // This will totally depend on no of 
       records say n hence O(n)

   Select max(input) from table;
   // here max would be an extra 
   overhead added to each 
   Therefore t*O(n) where t is max 
   Evaluation time

Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:

Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
- SQL keywords
- SQL function names
- string literals with character escapes
- string literals without character escape
- string literals which are dates or date+time
- numeric literals
- comma
- parenthesis
- SQL comments (--, /* ... */)
- quoted user words
- non quoted user words: everything else
Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
Add the scores of each token.
Done.

This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.

By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).

The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)

Note: I have been inspired by @redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).

Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).

Here is a link that goes over some of the things you should look at in the execution plan.

Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.

SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).

A good question. The problem is that for a SQL query like:

SELECT * FROM foo;

the complexity may depend on what "foo" is and on the database implementation. For a function like:

int f( int n ) {
   if ( n == 42 ) {
      return 0;
   }
   else {
      return n;
   }
}

there is no such dependency.

However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.

It's reasonably enough to consider complexity as what it would be if you coded the query yourself. If the table has N rows then,

A simple SELECT would be O(N)
A ORDER BY is O(NlogN)
A JOIN is O(N*M)
A DROP TABLE is O(1)
A SELECT DISTINCT is O(N^2)
A Query₁ NOT IN/IN Query₂ would be O( O₁(N) * O₂(N) )

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow