Normalizing a table: finding unique columns over series of rows (Oracle 10.x)

https://stackoverflow.com/questions/3819810

26-09-2019
|

Question

I have a table with the following structure:

WorkerPersons
-------------------------------
ID          (PK)
PersonID    (Indicates which version of Person the record describes)
SomeColumn1 (data specific to Worker)
SomeColumn2 (data specific to Person)
....
SomeColumnN
-------------------------------

As you can see, it's a denormalized table, which holds both Worker and Person (and many versions of one Person) data in one table. My wish is to normalize that table, however, as the table holds a lot of data (many many columns), I need to be sure which columns should go to Workers table and which columns to Persons table. The outcome should be like this:

Workers                 Persons
----------------------- ---------------------
ID                      ID
PersonID (now a FK)     PersonColumn1
WorkerColumn1           PersonColumn2
WorkerColumn2           ...
...                     PersonColumnN
WorkerColumnN
----------------------- ---------------------

To do that, I need to analyze which data differs in scope of Person over all unique Persons (wich are separated by PersonID in WorkerPersons). For example:

WorkerPersons
-------------------------------------------------------
ID      PersonID      Column1      Column2      Column3
-------------------------------------------------------
1       PersonA       10.1         John Doe     Single
2       PersonA       10.1         John Doe     Single
3       PersonA       10.1         John Doe     Married
4       PersonB       09.2         Sully        Single
5       PersonB       09.2         Sullivan     Single

In this case, there are 3 versions on PersonA and 2 versions of PersonB. Column1 values are always the same over all versions of Person, and we can move that column to table Worker. But Column 2 and Column3 values change over different versions of Person, so those values should be moved to Person table.

No imagine, I have about 10 tables like this that need to be normalized, with about 40 columns in each. Eeach table holds about 500k to 5m rows.

I need a script that helps me analyse which columns to move where. I need a script that outputs all columns that change in scope of unique Person over the whole table. I've no ideas however how to do that. I experimented with LAG analytical function to compare against the next row but how in the world to output changed columns is beyond me.

Please advise.

Best wishes, Andrew

Solution

Since 10 tables is not a lot, here is (some sort of) pseudo code

for each table_name in tables
  for each column_name in columns
    case (exists (select 1
          from table_name
          group by PersonID
          having min(column_name) = max(column_name))
       when true then 'Worker'
       when false then 'Person'
    end case
  end for
end for

with information schema and dynamic queries you could make the above proper PL/SQL or take the core query and script it in your favourite language.

EDIT: The above assumes no NULLs in column_name.

EDIT2: Other variants of the core query can be

SELECT 1
FROM 
(SELECT COUNT(DISTINCT column_name) AS distinct_values_by_pid
FROM table_name
GROUP BY PersonID) T
HAVING MIN(distinct_values_by_pid) = MAX(distinct_values_by_pid)

Which will return a row if all values per PersonID are the same. (this query also has problems with NULLS, but I consider NULLs a separate issue; you can always cast a NULL to some out-of-domain value for purposes of the above query)

The above query can be also written as

SELECT MIN(c1)=MAX(c1), MIN(c2)=MAX(c2), ...
FROM 
(SELECT COUNT(DISTINCT column_name_1) AS c1, COUNT(DISTINCT column_name_2) AS c2, ...
FROM table_name
GROUP BY PersonID) T

Which will test multiple columns at the same time returning true for columns that belong to 'Workers' and false for columns that should go into 'Persons'.

OTHER TIPS

Thanks, but I solved it by letting Excel create series of selects over table schema information. The final query that it generated was a long list of selects but it works (although it run over a hour). The "core query" (actually a formula in Excel to create to core query):

=IF(AND(C17<>"CLOB";C17<>"NCLOB");"SELECT '"&A17&".'||initcap('"&B17&"') description,
decode(count(*),0,'SAME OVE VERSIONS','DIFFERENT OVER VERSIONS') values FROM (SELECT 
objektid, count(DISTINCT nvl("&B17&","&IF(C17="DATE";"'01.02.0004'";IF(C17="VARCHAR2"
;"'!#¤¤%¤(%#¤%AS'";"-1234561"))&")) OVER (PARTITION BY objectid) arv FROM "&A17&") 
WHERE number > 1 union all";"SELECT '"&A17&".'||initcap('"&B17&"') description, 'CLOB
field' values from dual union all")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow