Question

Though my post is similar to this, I still feel this is in a way different from that one.

I have 2 CSV files.

File A                                       File B
+-------------------------------------------------------------------+
| Name         | Country                     Name         | Country |
+-------------------------------------------------------------------+
| Ferrari      | Italy                       Jaguar       | British |
| Mercedes     | Germany                     Chevrolet    | America |
| Jaguar       | British                     Bugatti      | Italy   |
| Nissan       | Japan                       Tata         | India   |
| Chevrolet    | USA                         Nissan       | Japan   |
+-------------------------------------------------------------------+

The above is used for illustration only. In general I have a lot more rows and columns in both of the files but their structure is same.

I was asked to perform a row level comparison of all the columns efficiently. So, instead of doing it programatically, I proposed using HSQLDB to CREATE TEXT TABLE and SET SOURCE with the files respectively and then do a EXCEPT operation between the files. I did code it and it works like a charm. Below is the SQL part that I coded to achieve the same.

CREATE TABLE COMPARE_TABLE AS (SELECT SRC.*, 'SRC-TGT' compare_order FROM TABLEA SRC EXCEPT SELECT TGT.*, 'SRC-TGT' compare_order FROM TABLEB TGT) WITH DATA;
INSERT INTO COMPARE_TABLE SELECT TGT.*, 'TGT-SRC' compare_order FROM TABLEB TGT EXCEPT SELECT SRC.*, 'TGT-SRC' compare_order FROM TABLEA SRC;

This gave me a result table which looks like the following (considering the sample data above):

COMPARE TABLE

+------------+----------+---------------+
|   Name     | Country  | Compare_order |
+------------+----------+---------------+
| Ferrari    | Italy    | SRC-TGT       |
| Mercedes   | Germany  | SRC-TGT       |
| Chevrolet  | USA      | SRC-TGT       |
| Chevrolet  | America  | TGT-SRC       |
| Bugatti    | Italy    | TGT-SRC       |
| Tata       | India    | TGT-SRC       |
+------------+----------+---------------+

From here, I need to identify the reason for every row as to what did not match? Atleast, broadly, I would like to classify it into 3 categories:

  • New at Source
  • New at Target
  • Value Changed (If possible, which column(s)?)

Finally, I want my table to look like the below:

COMPARE_TABLE

+------------+----------+---------------+------------------------+
|   Name     | Country  | Compare_order |     Failure_Reason     |
+------------+----------+---------------+------------------------+
| Ferrari    | Italy    | SRC-TGT       | New at Source          |
| Mercedes   | Germany  | SRC-TGT       | New at Source          |
| Chevrolet  | USA      | SRC-TGT       | Country value mismatch |
| Chevrolet  | America  | TGT-SRC       | Country value mismatch |
| Bugatti    | Italy    | TGT-SRC       | New at Target          |
| Tata       | India    | TGT-SRC       | New at Target          |
+------------+----------+---------------+------------------------+

How do I go about doing this? Can we even do this in SQL?

Any help is greatly appreciated.

Was it helpful?

Solution

You can do the basic comparison as:

select name, country,
       (case when sum(which = 'src') > 0 and sum(which = 'tgt') then 'DROPPED'
             when sum(which = 'src') = 0 and sum(which = 'tgt') then 'NEW'
        end) as OP
from ((select 'src' as which, name, country
       from tableA
      ) union all
      (select 'tgt', name, country
       from tableB
      )
     ) ab
group by name, country;

But that doesn't give you column-by-column comparisons. That is a bit more difficult. Let me assume that Name is unique, so it can be used as a key. The following does the comparison, but produces one row per name:

select name,
       (case when src.country is null then dest.country
             when tgt.country is null then tgt.country
             when src.country = tgt.country then dest.country
             else (src.country, '-->', tgt.country)
        end) as country,
       (case when src.country is null then 'new'
             when tgt.country is null then 'dropped'
             when src.country = tgt.country then 'same'
             else 'changed'
        end) as country,
from (select name from TableA union select name from TableB
     ) names left outer join
     TableA src
     on names.name = src.name left outer join
     TableB tgt
     on names.name = tgt.name;

Getting multiple rows for each name when the value in a column changes seems more difficult, although it too would be possible.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top