Question

I'm designing an application which will in an early stage collect data A, B, and C from clients, but later on will instead collect data A, B, and D.

A, B, C, and D are very related and right now exist as columns of a single database PostgreSQL table T.

Once C is no longer needed, I want to remove its references from my application (I use the Django ORM), but I want to keep the data that was already entered. What is the best way to do so?

I've thought of creating a new table for ABD, but that means that might cause issues with any rows referencing table T.

I could just leave column C along, and remove references to it in the code, allowing the existing data to survive.

Is there a better option I'm not seeing?

Some extra details:

The number of rows will not be big, most likely 1-2 per user. This is a mass market application, but by the time I switch from C to D, the userbase will not be very large yet. C and D will likely not be collected at the same time, although that is a possibility. C and D likely represent multiple columns each, not just one each.

Was it helpful?

Solution

If you want to keep the data, then it's not obsolete. Just leave it where it is. It's fine if some class mapped to a table doesn't map every column.

OTHER TIPS

OK so your situation is that you want the old rows to have property C but the new ones not.

This is equivalent to having a class inheritance relationship

class All
{
    string A;
    string B;
}

class Old : All
{
    string C;
}

class New : All
{
    string D;
}

which you would represent on the database with three tables with 1 to 1 relations

table All
    id varchar
    A varchar
    B varchar

table Old
    id varchar
    C  varchar

table New
    id varchar
    D  varchar

So you could create a migration script to create the new Old table, copy the id and C data to it and remove the C column from the All table.

Updating your code as required with the new sql;

Alternatively, If you just require to be able to query the old C data you could make a new Archive table with A,B,C copy all the data and remove the C column, add the D col to your 'Live' table

If data storage could be a concern, then split the tables: key/A/B key/C key/D

You can perform access either via a view (definition of the data location in the db) or via changing the ORM definition.

This is not the most performant (a join is involved), but it can present any combination of A/B/C/D over time without changing the underlying storage & depending on your real access patterns it may be sufficient.

You may not be fortunate with the ability to take downtime, restructure tables etc. in a production system.

Performing the access via the view allows you to switch from A/B/C to A/B/C/D to A/B/D in the underlying table with minimal change and no data movement. A view will be transparent to the read logic and if your dbms supports either functions or updateable views then transparent to the write logic as well.

Really I think your decision will reflect lots of the real-world concerns: 1) what are datatypes C & D 2) the relative data volumes collected for C/D 3) Relative overlap of C/D data compared to purely C or D entries 4) Downtime/maintenance window availability and duration 5) DBMS Support for updateable views 6) Desirability of keeping db physical structure details in the ORM vs making it transparent by presenting via views/functions in the db (where it is the same for all accessing applications, not just the current one)

My answer preferred for large/complex datatypes for (1), little overlap for (3) and minimal downtime for (4), ideally with good dbms support in (5) and multiple applications accessing the data in (6)

But there is no right/wrong for lots of alternativeS: - start with A/B/C, later add D, adjusting ORM, still later drop column C - start with A/B/C/D & ignore nulls etc. I think, consider your solution & what you know of its intended purpose/lifecycle, do some size/volume modelling & expect to change things later as not everything will turn our as expected.

Removing references & orphaning the data is a low risk option.

There are always possible unknown 'backdoor' uses of the data which may or may not be important to expose by removing the column.

Depending on the content of column C there could be a minor performance issue when the DB internally does full table scans or attempts to pull the whole table into memory during joins if the optimizer sees this as being more efficient than using indexes.

Applications may be reading the whole table a times rather than selected columns - but if you're using an ORM exclusively then this is unlikely.

Many things to consider here but you might want to consider adding a view to overlay the table rather than making changes to the table directly. That way, it is only the view that needs to change.

I don't know Django ORM, but it could be a possibility.

  • You have a Table A with columns a, b, c.
  • Create a new Table B with columns a, b, d.
  • Migrate your data to Table B.
  • Move your foreign keys to table A to table B.

You can now use Table B and you still have your old data for reference.

Licensed under: CC-BY-SA with attribution
scroll top