Question

Introduction:

I am trying to figure out what kind of database I should use for my project to get the best speed possible.

Background:

I am not sure if MySQL is the best solution, but I will use MySQL as an example to explain the data that I currently have. The data currently (or will) be arranged in 2 tables.

Table 1 has have 3 columns and Table 2 has have 1 column. Both tables will have around 1 billion rows each.

The project will get the first value from Table 2 and check if it exists in the first column of Table 1.

This project is currently running on a computer which has a mechanical drive.

Rough estimates suggest at the data will use around 500GB of drive space.

Goals:

  1. To be able to insert the data into the tables as fast as possible,
  2. To be able to compare each column in Table 2 to see if the same value exists in column 1 of Table 1 as fast as possible.

Progress:

At the moment, I am inserting the data into a MySQL database and batching the inserts into groups of 1000, which is averaging to about 200 inserts per second. At this rate, it will take about 2 months to generate and insert the 1 billion records..

Benchmark:

The process of generating the data and inserting the data is using around 5% CPU, around 1GB RAM of the 16GB available, and when checking the disk IO, it is averaging about 2MB/s. So my guess is that the mechanical drive is the bottleneck. I can constantly hear it seeking away.

Question:

With the possible mechanical drive as the bottleneck in mind (correct me if I am wrong about this being the bottleneck), is MySQL the best option to achieve the 2 goals above as fast as possible, or is there a different type of database I should be considering and using instead?

Was it helpful?

Solution

XY Problem

I see too many Red Flags in your post to leave a comment. In general, you seem to have an XY Problem.

Problem 1

I can constantly hear it grinding away.

STOP! RUN! DO NOT WALK! GET THIS HD REPLACED ASAP!!

If it is really making a grinding sound, then your HD is already toast.

Problem 2

it is averaging about 2MB/s

Modern Hard drives can achieve an IO transfer speed in the 600 MB/s range. Your observation means you have a non-DB related issue.

Make sure you run your IO benchmark outside of the DB. If your Data Process is not coming close to that measured transfer speed, then the bottleneck is not with the disk IO.

Usually, it is with the way you are processing your data.

Problem 3

averaging to about 200 inserts per second.

Ahh!! Now we are getting to the heart of the problem.

This is usually an indication that you are using a slow-by-slow process method to load data.

You need to load data in Bulk. When you do that, you should see INSERT rates closer to 1M rows in mere seconds. Your IO transfer rates should then be closer to what you measured outside of the DB.

Problem 4

With the possible mechanical drive as the bottleneck

Use a RAID setup to improve IO/IOPS performance. This applies to both HDDs and SSDs.

Problem 5

  1. To be able to compare each column in Table 2 to see if the same value exists in column 1 of Table 1 as fast as possible.

Relational Databases are built based on Relational Algebra. If you can change your Business Requirement to be a Relational Algebra question, then the solution becomes a no-brainer

I suggest you Change your Requirements so that

  1. The values in Table_1.Column_1 need to be UNIQUE and NOT NULL
  2. The database is not allowed to accept data for Table_2 that does not comply with the stated requirements.

From there, the answer becomes "Use a FOREIGN KEY on each of the columns in Table_2 that points to the PRIMARY KEY (TABLE_1.COLUMN_1)".

The overhead for the FK check during INSERT should be minimal.

I am unaware of any of the major relational databases that cannot handle a 1 Billion row child table and/or 1 Billion row parent table. So, MySQL should be just fine. (BTW - "List of Software" questions are off topic for this forum.)

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top