Can MySQL handle 100 million+ rows? [duplicate]

https://stackoverflow.com/questions/23302238

09-07-2023
|

Question

I run a small to medium car website and we are trying to log how many times a visit goes to vehicles detail page. We do this by hashing, md5, the make, model, and zip of the current vehicle. We then keep a vehicle_count total and increment this if the hashes match.

After running the numbers there appears to be about 50 makes, each make has about 50 models, and our locations db has about 44,000 unique zip codes. Roughly 100 million+ potential of unique hashes

This is the create table:

CREATE TABLE `vehicle_detail_page` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `vehicle_hash` char(32) NOT NULL,
  `make` varchar(100) NOT NULL,
  `model` varchar(100) NOT NULL,
  `zip_code` char(7) DEFAULT NULL,
  `vehicle_count` int(6) unsigned DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `vehicle_hash` (`vehicle_hash`),
  KEY `make` (`make`),
  KEY `model` (`model`),
  KEY `zip_code` (`zip_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This is the PHP code to insert/update the table:

public function insertUpdate($make, $model, $zip)
{
    // set table
    $table = self::TABLE;        
    // create hash
    $hash = md5($make.$model.$zip);

    // insert or update count
    try
    {
        $stmt = $this->db->conn->prepare("INSERT INTO $table
                                                (vehicle_hash, 
                                                    make, 
                                                    model, 
                                                    zip_code)
                                          VALUES
                                                (:vehicle_hash, 
                                                    :make, 
                                                    :model, 
                                                    :zip_code)
                                          ON DUPLICATE KEY UPDATE
                                                    vehicle_count = vehicle_count + 1;");
        $stmt->bindParam(':vehicle_hash', $hash, PDO::PARAM_STR);
        $stmt->bindParam(':make', $make, PDO::PARAM_STR);
        $stmt->bindParam(':model', $model, PDO::PARAM_STR);
        $stmt->bindParam(':zip_code', $zip, PDO::PARAM_STR);
        $stmt->execute();
    } catch (Exception $e)
    {
        return FALSE;
    }

    return TRUE;
}

Questions:

Can MySQL handle this many rows?
Does anyone see anything wrong with this code, and is there a better way to do this?
What will querying this data be like?

The Big question is, once this table grows how will that php function above perform. If/when that table has a few million+ rows, how will that table perform. Can anyone give some insight?

Solution

You could also avoid the hash altogether.

CREATE TABLE `vehicle_visits` (
  `make` varchar(100) DEFAULT NULL,
  `model` varchar(100) DEFAULT NULL,
  `zip_code` char(7) DEFAULT NULL,
  `vehicle_count` int(11) DEFAULT NULL,
  UNIQUE KEY `make_model_zip` (`make`,`model`,`zip_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This avoids having multiple UNIQUE values. Instead of "ID" and "Hash", you can use real world values to create the UNIQUE identifier. Notice how MySQL can use 3 columns to form a unique index.

Note: to decrease the size of your index, you can decrease the size of make and model columns. Unless you are expecting to have 100 character make and model name of course. If you are worried about size, you can also create an index using a prefix of each of the columns.

Edit: adding the hash column as an index method

As an alternative to a composite index, you can introduce a column that is “hashed” based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns. http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html

You will need to do some real world tests to see which method is quicker. Since the data shows about 50 makes and 50 models, the lookup will mostly involve the zip_code column. Index order also makes a difference. Also, creating an index using prefixes such as make(10), model(10), zip(7), creates an index of length 27. On the other hand, an md5 column would be 32.

The hash method may help with lookups, but will it really help with real world applications? This table seems to track visitors, and will most likely have analytics performed on it. The index will help with SUM() operations (depending on the order of the index). For example, if I want to find the total number of visitors to "Honda" or "Honda Civic" page, it is easily done with the multiple column index.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow