Should I use UUIDs or integer primary keys to optimize for massive writes of relational data?

dba.stackexchange https://dba.stackexchange.com/questions/161545

  •  05-10-2020
  •  | 
  •  

Question

I am working on a computer vision data pipeline and am unsure of how to structure my database to optimize for writes.

I have massive amounts of image data that is being collected on an ongoing basis. Image frames are used to build 1-3 second video clips which are to be labelled by a remote workforce. Workers label each clip (using a web application I built) with various properties (does the clip contain object x?).

My current pipeline generates the video clips and sends them to S3. An Amazon Aurora (with MySQL compatibility) database is used to track each image frame, clip, and associated tags.

The 'frames' table contains an entry for every single image frame, with associated metadata.

The 'clips' table contains an entry for every clip, has a field 'start_frame_id', which is a foreign key defining the first frame from the 'frames' table in the given clip. The associated clips are accessed from S3 by the remote workforce, using the sha256 hash of the clip as a key.

The 'labels' table contains an entry for each label created by a worker, and is related to the 'clips' table.

Both the 'clips' and 'frames' table contain a sha256 hash of the original file.

This database needs to be heavily optimized for writes, as the number of frames and clips will be massive (approximately 500K frames will be added per day, clips are 20fps). All uploads to S3 and writes to the database are done from local machines.

The prototype that I have built uses auto-incrementing integers for primary keys. However, this requires the client to execute database writes in small chunks. Since each clip needs to have a reference to its start frame, it is necessary that I commit all the frames for a given clip in order to obtain the primary key of the first frame, before I can commit the clip. This solution also makes it tricky/impossible to later add insert-only write replicas. For this reason, I am debating using UUIDs instead of integers, but I know this can cause performance problems with joins.

Should I use UUIDs or integers?

Was it helpful?

Solution

UUIDs are useful when you have clients independently generating unique identifiers.

id INT UNSIGNED AUTO_INCREMENT is smaller, faster, 'ordered', etc.

Use UUIDs only if you don't have a viable alternative. More discussion: http://mysql.rjweb.org/doc.php/uuid

In my opinion, sha256 is overkill for a 'digest'.

500K rows INSERTed per day? That's 6/second? Not a problem. When you get to 100/sec, we should talk further.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top