How should a table with two sets of almost duplicate column names be designed?

https://stackoverflow.com/questions/4400197

25-09-2019
|

Question

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.

I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?

EDIT: if it helps, we also use this data to find a matched cohort

Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.

Thanks!

Solution

Two possible architectures and a query tip.

1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.

2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.

Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow