Question

Simplified example: Two tables - people and times. Goal is to keep track of all the times a person walks through a doorway.

A person could have between 0 and 50 entries in the times table daily.

What is the proper and most efficient way to keep track of these records? Is it

times table
-----------
person_id
timestamp

I'm worried that this table can get well over a million records rather quickly. Insertion and retrieval times are of utmost importance.

ALSO: Obviously non-normalized but would it be a better idea to do

times table
-----------
person_id
serialized_timestamps_for_the_day
date

We need to access each individual timestamp for the person but ONLY query records on date or the person's id.

Was it helpful?

Solution

The second solution has some problems:

  • Since you need to access individual timestamps1, serialized_timestamps_for_the_day cannot be considered atomic and would violate the 1NF, causing a bunch of problems.
  • On top of that, you are introducing a redundancy: the date can be inferred from the contents of the serialized_timestamps_for_the_day, and your application code would need to make sure they never become "desynchronized", which is vulnerable to bugs.2

Therefore go with the first solution. If properly indexed, a modern database on modern hardware can handle much more than mere "well over a million records". In this specific case:

  • A composite index on {person_id, timestamp} will allow you to query for person or combination of person and date by a simple index range scan, which can be very efficient.
  • If you need just "by date" query, you'll need an index on {timestamp}. You can easily search for all timestamps within a specific date by searching for a range 00:00 to 24:00 of the given day.

1 Even if you don't query for individual timestamps, you still need to write them to the database one-by-one. If you have a serialized field, you first need to read the whole field to append just one value, and then write the whole result back to the database, which may become a performance problem rather quickly. And there are other problems, as mentioned in the link above.

2 As a general rule, what can be inferred should not be stored, unless there is a good performance reason to do so, and I don't see any here.

OTHER TIPS

Consider what are we talking about here. Accounting for just raw data (event_time, user_id) this would be (4 + 4) * 1M ~ 8MB per 1M rows. Let's try to roughly estimate this in a DB.

One integer 4 bytes, timestamp 4 bytes; row header, say 18 bytes -- this brings the first estimate of the row size to 4 + 4 + 18 = 26 bytes. Using page fill factor of about 0.7; ==> 26 / 0.7 ~ 37 bytes per row.

So, for 1 M rows that would be about 37 MB. You will need index on (user_id, event_time), so let's simply double the original to 37 * 2 = 74 MB.

This brings the very rough, inacurate estimate to 74MB per 1M rows.

So, to keep this in memory all the time, you would need 0.074 GB for each 1M rows of this table.

To get better estimate, simply create a table, add the index and fill it with few million rows.

Given the expected data volume, this can all easily be tested with 10M rows even on a laptop -- testing always beats speculating.

P.S. Your option 2 does not look "obviously better idea" too me at all.

I think first option would be a better option.

Even if you go for second option, the size of the index might not reduce. In fact there will be an additional column.

And the data for different users is not related, you can shard the database based on person_id. i.e. let's say your data cannot be fit on a single database server node and requires two nodes. Then data for half the users will be stored on one node and rest of the data will be stored on another node.

This can be done using RDBMS like MySQL or Document oriented databases like MongoDB and OrientDB as well.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top