Data storage format: byte array alternatives?

https://stackoverflow.com/questions/17172588

01-06-2022
|

Question

I have a desktop app that has the concept of an entity called Field.

-----------------------
|   Id    | FieldName |
-----------------------
|    1    | "Field 1" |
-----------------------
|    2    | "Field 2" |
-----------------------

Fields are defined by the user, so there can be as many of them as the user wants. They are associated with another entity called Employee.

Fields have a value (a 16-bits integer calculated and stored by the app) for each day of the year.

Field values are stored in a table where each record holds the values for one full year of one Employee of one Field.

Said table, therefore, looks a bit like this:

---------------------------------------------
| FieldId | EmployeeId | FieldValues | Year |
---------------------------------------------
|    1    |      4     |    byte[]   | 2012 |
---------------------------------------------
|    2    |      4     |    byte[]   | 2012 |
---------------------------------------------
|    1    |      5     |    byte[]   | 2013 |
---------------------------------------------
|   ...   |     ...    |     ...     |  ... |
---------------------------------------------

FieldValues holds the values as a byte array in a BLOB field, which is then converted back to an array of 16-bits integers before being shown to the user on a grid.

Now that we have a bit of context, the real question.

This is a legacy app, I am not the original designer. It's easy to guess, though, that the goal of storing this data in a binary format was to limit the number of records that would otherwise be necessary to store 365 (or 366) values per year per Employee per Field.

What I'm doing now is a "sync" app which pulls this data from a local Access db (don't ask) and pushes it via a REST API to a web app on a remote server. Such app needs to have a copy of this data so I'll have to store it in its database.

Storing data in a binary format has the clear advantage of really limiting the number of records we need to store, but the disadvantage of being human-unreadable.

On the other hand, the web app is multi-tenant, so storing this data in any other way would mean storing a great number of records: just a couple thousand Employees and an average of 20 Fields would mean storing upwards of 14 million records each year (and Fields are not the only entity that could generate millions of records). Plus, a large number of records per-year wouldn't be a problem per se if somewhere down the road, say every two or three years, we could throw them away; that, however, is not the case.

The real question, then, is how to store said data. Should I stick to the old format?

Can anyone think of a whole different way of going about it?

For the sake of completeness, even though I don't think it matters much, the destination db is Postgres.

Solution

You should if at all possible properly normalize this data.

Here are some reasons.

Storing data in a binary format has the clear advantage of really limiting the number of records we need to store, but the disadvantage of being human-unreadable.

There are other disadvantages that you're missing including increased concurrency since you have to write all the values back. None of the queries against this data are going to be SARGable, you can't constrain this data on the db level, basically all the problems you have when you violate 1NF

Plus, a large number of records per-year wouldn't be a problem per se if somewhere down the road, say every two or three years, we could throw them away; that, however, is not the case.

I can't think of a valid reason why you can't have a data retention policy. It's very dangerous to do this.

On the other hand, the web app is multi-tenant, so storing this data in any other way would mean storing a great number of records: just a couple thousand Employees and an average of 20 Fields would mean storing upwards of 14 million records each year

That's not a lot of records. Also typically it's the amount of data that you're storing that tends to be an issue first. Most of which is occupied by the data in FieldValues and not the internal bookkeeping that the database has to do.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow