Alternatives to concatenating strings or going procedural to prevent SQL query code repetition?

https://dba.stackexchange.com/questions/2710

16-10-2019
|

Pergunta

Disclaimer: Please bear with me as someone who only uses databases a tiny fraction of his work time. (Most of the time I do C++ programming in my job, but every odd month I need to search/fix/add something in an Oracle database.)

I have repeatedly needed to write complex SQL queries, both for ad-hoc queries and for queries built into applications, where large parts of the queries where just repeated "code".

Writing such abominations in a traditional programming language would get you in deep trouble, yet I (I) have yet been unable to find any decent technique to prevent SQL query code repetition.

Edit: 1st, I want to thank the answerers who provided excellent improvements to my original example. However, this question is not about my example. It's about repetitiveness in SQL queries. As such, the answers (JackP, Leigh) so far do a great job of showing that you can reduce repetitiveness by writing better queries. However even then you face some repetitiveness that apparently cannot be removed: This always nagged me with SQL. In "traditional" programming languages I can refactor quite a lot to minimize repetitiveness in the code, but with SQL it seems that there are no(?) tools that allow for this, except for writing a less repetitive statement to begin with.

Note that I have removed the Oracle tag again, as I would be genuinely interested whether there's no database or scripting language that allows for something more.

Here's one such gem that I cobbled together today. It basically reports the difference in a set of columns of a single table. Please skim through the following code, esp. the large query at the end. I'll continue below.

--
-- Create Table to test queries
--
CREATE TABLE TEST_ATTRIBS (
id NUMBER PRIMARY KEY,
name  VARCHAR2(300) UNIQUE,
attr1 VARCHAR2(2000),
attr2 VARCHAR2(2000),
attr3 INTEGER,
attr4 NUMBER,
attr5 VARCHAR2(2000)
);

--
-- insert some test data
--
insert into TEST_ATTRIBS values ( 1, 'Alfred',   'a', 'Foobar', 33, 44, 'e');
insert into TEST_ATTRIBS values ( 2, 'Batman',   'b', 'Foobar', 66, 44, 'e');
insert into TEST_ATTRIBS values ( 3, 'Chris',    'c', 'Foobar', 99, 44, 'e');
insert into TEST_ATTRIBS values ( 4, 'Dorothee', 'd', 'Foobar', 33, 44, 'e');
insert into TEST_ATTRIBS values ( 5, 'Emilia',   'e', 'Barfoo', 66, 44, 'e');
insert into TEST_ATTRIBS values ( 6, 'Francis',  'f', 'Barfoo', 99, 44, 'e');
insert into TEST_ATTRIBS values ( 7, 'Gustav',   'g', 'Foobar', 33, 44, 'e');
insert into TEST_ATTRIBS values ( 8, 'Homer',    'h', 'Foobar', 66, 44, 'e');
insert into TEST_ATTRIBS values ( 9, 'Ingrid',   'i', 'Foobar', 99, 44, 'e');
insert into TEST_ATTRIBS values (10, 'Jason',    'j', 'Bob',    33, 44, 'e');
insert into TEST_ATTRIBS values (12, 'Konrad',   'k', 'Bob',    66, 44, 'e');
insert into TEST_ATTRIBS values (13, 'Lucas',    'l', 'Foobar', 99, 44, 'e');

insert into TEST_ATTRIBS values (14, 'DUP_Alfred',   'a', 'FOOBAR', 33, 44, 'e');
insert into TEST_ATTRIBS values (15, 'DUP_Chris',    'c', 'Foobar', 66, 44, 'e');
insert into TEST_ATTRIBS values (16, 'DUP_Dorothee', 'd', 'Foobar', 99, 44, 'e');
insert into TEST_ATTRIBS values (17, 'DUP_Gustav',   'X', 'Foobar', 33, 44, 'e');
insert into TEST_ATTRIBS values (18, 'DUP_Homer',    'h', 'Foobar', 66, 44, 'e');
insert into TEST_ATTRIBS values (19, 'DUP_Ingrid',   'Y', 'foo',    99, 44, 'e');

insert into TEST_ATTRIBS values (20, 'Martha',   'm', 'Bob',    33, 88, 'f');

-- Create comparison view
CREATE OR REPLACE VIEW TA_SELFCMP as
select 
t1.id as id_1, t2.id as id_2, t1.name as name, t2.name as name_dup,
t1.attr1 as attr1_1, t1.attr2 as attr2_1, t1.attr3 as attr3_1, t1.attr4 as attr4_1, t1.attr5 as attr5_1,
t2.attr1 as attr1_2, t2.attr2 as attr2_2, t2.attr3 as attr3_2, t2.attr4 as attr4_2, t2.attr5 as attr5_2
from TEST_ATTRIBS t1, TEST_ATTRIBS t2
where t1.id <> t2.id
and t1.name <> t2.name
and t1.name = REPLACE(t2.name, 'DUP_', '')
;

-- NOTE THIS PIECE OF HORRIBLE CODE REPETITION --
-- Create comparison report
-- compare 1st attribute
select 'attr1' as Different,
id_1, id_2, name, name_dup,
CAST(attr1_1 AS VARCHAR2(2000)) as Val1, CAST(attr1_2 AS VARCHAR2(2000)) as Val2
from TA_SELFCMP
where attr1_1 <> attr1_2
or (attr1_1 is null and attr1_2 is not null)
or (attr1_1 is not null and attr1_2 is null)
union
-- compare 2nd attribute
select 'attr2' as Different,
id_1, id_2, name, name_dup,
CAST(attr2_1 AS VARCHAR2(2000)) as Val1, CAST(attr2_2 AS VARCHAR2(2000)) as Val2
from TA_SELFCMP
where attr2_1 <> attr2_2
or (attr2_1 is null and attr2_2 is not null)
or (attr2_1 is not null and attr2_2 is null)
union
-- compare 3rd attribute
select 'attr3' as Different,
id_1, id_2, name, name_dup,
CAST(attr3_1 AS VARCHAR2(2000)) as Val1, CAST(attr3_2 AS VARCHAR2(2000)) as Val2
from TA_SELFCMP
where attr3_1 <> attr3_2
or (attr3_1 is null and attr3_2 is not null)
or (attr3_1 is not null and attr3_2 is null)
union
-- compare 4th attribute
select 'attr4' as Different,
id_1, id_2, name, name_dup,
CAST(attr4_1 AS VARCHAR2(2000)) as Val1, CAST(attr4_2 AS VARCHAR2(2000)) as Val2
from TA_SELFCMP
where attr4_1 <> attr4_2
or (attr4_1 is null and attr4_2 is not null)
or (attr4_1 is not null and attr4_2 is null)
union
-- compare 5th attribute
select 'attr5' as Different,
id_1, id_2, name, name_dup,
CAST(attr5_1 AS VARCHAR2(2000)) as Val1, CAST(attr5_2 AS VARCHAR2(2000)) as Val2
from TA_SELFCMP
where attr5_1 <> attr5_2
or (attr5_1 is null and attr5_2 is not null)
or (attr5_1 is not null and attr5_2 is null)
;

As you can see, the query to generate a "difference report" uses the same SQL SELECT block 5 times (could easily be 42 times!). This strikes me as absolutely brain dead (I'm allowed to say this, after all I wrote the code), but I haven't been able to find any good solution to this.

If this would be a query in some actual application code, I could write a function that cobbles together this query as a string and then I'd execute query as a string.
- -> Building strings is horrible and horrible to test and maintain. If the "application code" is written in a language such as PL/SQL it feels so wrong it hurts.
Alternatively, if used from PL/SQL or its like, I'd guess there be some procedural means to make this query more maintainable.
- -> Unrolling something that can be expressed in a single query into procedural steps just to prevent code repetition feels wrong too.
If this query would be needed as a view in the database, then - as far as I understand - there would be no way other than to actually maintain the view definition as I posted above. (!!?)
- -> I actually had to do some maintenance on a 2-page view definition once that wasn't far off above statement. Obviously, changing anything in this view required a regexp text search over the view definition for whether the same sub-statement had been used in another line and whether it needed changing there.

So, as the title goes - what techniques are there to prevent having to write such abominations?

Solução

You are too modest - your SQL is well and concisely written given the task you are undertaking. A few pointers:

t1.name <> t2.name is always true if t1.name = REPLACE(t2.name, 'DUP_', '') - you can drop the former
usually you want union all. union means union all then drop duplicates. It might make no difference in this case but always using union all is a good habit unless you explicitly want to drop any duplicates.

if you are willing for the numeric comparisons to happen after casting to varchar, the following might be worth considering:

create view test_attribs_cast as 
select id, name, attr1, attr2, cast(attr3 as varchar(2000)) as attr3, 
       cast(attr4 as varchar(2000)) as attr4, attr5
from test_attribs;

create view test_attribs_unpivot as 
select id, name, 1 as attr#, attr1 as attr from test_attribs_cast union all
select id, name, 2, attr2 from test_attribs_cast union all
select id, name, 3, attr3 from test_attribs_cast union all
select id, name, 4, attr4 from test_attribs_cast union all
select id, name, 5, attr5 from test_attribs_cast;

select 'attr'||t1.attr# as different, t1.id as id_1, t2.id as id_2, t1.name, 
       t2.name as name_dup, t1.attr as val1, t2.attr as val2
from test_attribs_unpivot t1 join test_attribs_unpivot t2 on(
       t1.id<>t2.id and 
       t1.name = replace(t2.name, 'DUP_', '') and 
       t1.attr#=t2.attr# )
where t1.attr<>t2.attr or (t1.attr is null and t2.attr is not null)
      or (t1.attr is not null and t2.attr is null);

the second view is a kind of unpivot operation - if you are on at least 11g you can do this more concisely with the unpivotclause - see here for an example

I'm say don't go down the procedural route if you can do it in SQL, but...
Dynamic SQL is probably worth considering despite the problems you mention with testing and maintenance

--EDIT--

To answer the more general side of the question, there are techniques to reduce repetition in SQL, including:

Views - you know about that one :)
Common Table Expressions (see here for example)
individual features of the database such as decode (see Leigh's answer for how this can reduce repetition), window functions and hierarchical/recursive queries to name but a few

But you can't bring OO ideas into the SQL world directly - in many cases repetition is fine if the query is readable and well-written, and it would be unwise to resort to dynamic SQL (for example) just to avoid repetition.

The final query including Leigh's suggested change and a CTE instead of a view could look something like this:

with t as ( select id, name, attr#, 
                   decode(attr#,1,attr1,2,attr2,3,attr3,4,attr4,attr5) attr
            from test_attribs
                 cross join (select rownum attr# from dual connect by rownum<=5))
select 'attr'||t1.attr# as different, t1.id as id_1, t2.id as id_2, t1.name, 
       t2.name as name_dup, t1.attr as val1, t2.attr as val2
from t t1 join test_attribs_unpivot t2 
               on( t1.id<>t2.id and 
                   t1.name = replace(t2.name, 'DUP_', '') and 
                   t1.attr#=t2.attr# )
where t1.attr<>t2.attr or (t1.attr is null and t2.attr is not null)
      or (t1.attr is not null and t2.attr is null);

Outras dicas

Here is an alternative to the test_attribs_unpivot view provided by JackPDouglas ⁽⁺¹⁾ that works in versions before 11g and does fewer full table scans:

CREATE OR REPLACE VIEW test_attribs_unpivot AS
   SELECT ID, Name, MyRow Attr#, CAST(
      DECODE(MyRow,1,attr1,2,attr2,3,attr3,4,attr4,attr5) AS VARCHAR2(2000)) attr
   FROM TEST_ATTRIBS 
   CROSS JOIN (SELECT level MyRow FROM dual connect by level<=5);

His final query can be used unchanged with this view.

I often encounter the similar problem to compare two versions of a table for new, deleted or changed rows. Some month ago I published a solution for SQL Server using PowerShell here .

To adapt it to your problem, I first create two views to separate the original from the duplicate rows

CREATE OR REPLACE VIEW V1_TEST_ATTRIBS AS 
select * from TEST_ATTRIBS where SUBSTR(name, 1, 4) <> 'DUP_'; 

CREATE OR REPLACE VIEW V2_TEST_ATTRIBS AS 
select id, REPLACE(name, 'DUP_', '') name, attr1, attr2, attr3, attr4, attr5 from TEST_ATTRIBS where SUBSTR(name, 1, 4) = 'DUP_';

and then I check the changes with

SELECT 1 SRC, NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 FROM V1_TEST_ATTRIBS
MINUS
Select 1 SRC, NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 from V2_TEST_ATTRIBS
UNION
SELECT 2 SRC, NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 FROM V2_TEST_ATTRIBS
MINUS
SELECT 2 SRC ,NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 FROM V1_TEST_ATTRIBS
ORDER BY NAME, SRC;

From here I can find your original ids

Select NVL(v1.id, v2.id) id,  t.name, t.attr1, t.attr2, t.attr3, t.attr4, t.attr5 from
(
SELECT 1 SRC, NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 FROM V1_TEST_ATTRIBS
MINUS
Select 1 SRC, NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 from V2_TEST_ATTRIBS
UNION
SELECT 2 SRC, NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 FROM V2_TEST_ATTRIBS
MINUS
Select 2 SRC ,NAME, ATTR1, ATTR2, ATTR3, ATTR4, ATTR5 from V1_TEST_ATTRIBS
) t
LEFT JOIN V1_TEST_ATTRIBS V1 ON T.NAME = V1.NAME AND T.SRC = 1
LEFT JOIN V2_TEST_ATTRIBS V2 ON T.NAME = V2.NAME AND T.SRC = 2
ORDER by NAME, SRC;

BTW: MINUS and UNION and GROUP BY treat different NULL's as equal. Using these operations makes the queries more elegant.

Hint for SQL Server users: MINUS is named EXCEPT there, but works similar.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a dba.stackexchange