Question

The problem

We need to store data in a table-like way, but we have very strict space constraints (~1Mb per table of 10k+ rows). We store data like this:

ID | reviews | factor | score | interval | etc.
---+---------+--------+-------+----------+-----
 1 |     244 |    2.4 |    10 |     4268 | ...

in a simple binary format (a one-dimensional array of bytes, where the index of each row can be calculated simply by knowing the length of each row, which is fixed).

There is only one function that ever reads this data (gets a row by its index), and only one function that appends a new row (to the end). Removing items from the table will never be required (the table is append-only). Both functions are covered with a decent amount of unit tests.

The problem is the following: we need to be able to quickly go through rows ordered by different columns. In other words, we need the data to be sorted by at least two columns.

A simple solution

To solve this problem, we would implement indexes that, again, would be chunks of binary data. Now I would intuitively do it by creating ordered data structures that only list the index of the row in the original table:

factor_index        score_index
------------        -----------
          6                  2
          2                  1
          3                  6
          1                  4
          .                  .

The function that appends a new row to the table would have to be updated to cause the indexes to be updated as well.

EXAMPLE: To get the first item ordered by score, we just look up the first value in the index table for score (2) and get the corresponding row from the original table (the third row if we agree that the table is zero-indexed).

However, I was suggested to take a different approach.

A more complex but presumably safer version

Instead of of only storing the indexes, we duplicate the ID fields in each index table:

factor_index | ID        score_index | ID
-------------+---        ------------+---
          6  | 46                  2 |  8
          2  |  8                  1 | 14
          3  | 91                  6 | 46
          1  | 14                  4 | 60
          .  |  .                  . |  .

Then keep the original table sorted by ID, and use the indexes only as a starting position for a binary search in the original table.

The function that appends a new record will now have to do a binary search by ID to find where to insert the new row, and cause the indexes to be updated.

EXAMPLE: To get the first item ordered by score, we look up the first row in the index table for score (2, 8), and use the index (2) as a starting position for a binary search in the table. If the data is valid, we don't even need to do a binary search, because at position 2 we will find the row with ID 8. If, however, we find that the record at position 2 has a different index, we continue with a binary search to find the right one, and log the error.

The argument for this approach is that it will work even if the index points to the wrong row in the table.

I find it hard to believe though that this approach is really better, for the following reasons:

  • It requires a binary search, which can be a new source of bugs.
  • It requires the table to be kept in order, which implies more complex insert (as opposed to a simple append).
  • It does not guard against the main table being out of order: if that happens, the index could even fail to find the record via binary search.
  • It requires writing (and testing) code that is never even meant to be run.
  • It uses more data than what is necessary.

The question

It is very high priority for our application that the above piece of data is always valid. But does that justify writing more complex data structures and lookup mechanisms to guard against edge cases which may or may not happen? Shouldn't the time and effort instead be put in writing more robust test cases for a simpler version?

No correct solution

Licensed under: CC-BY-SA with attribution
scroll top