Which database to use when you have a 6 billion of rows inside and need to query rows with from list of IDs? [closed]

https://dba.stackexchange.com/questions/276035

08-03-2021
|

Question

We are currently researching our case of storing the road distances between cities. Right now we have 6 billion of those distances.

Our structure right now in SQL Server is that we have a float which represents the relationship between cities. For example if we have a city with Id 1 inside the Locations table and a city with Id 2 inside the same table, a row with distance from 1 to 2 will look like 1.2,'1000 miles'. That column is indexed.

So to get the distance from the city 1000 to the city 2535, we would find 1000.2535 inside the Distances.

Besides getting single distances, we need to select groups of 1000 distances from those 6 billion rows:

SELECT
  id
 ,distance
FROM
  Distances 
WHERE 
  id IN (1000.2535, 1.2, ...)

Right now we've only tested SQL Server on a local machine and it gives us around 300 ms for such query of 1000 rows, but only when we set a 50 ms timeout (this is needed for a lot of parallel requests from multiple users), if 50 ms timeout is not used it just grows exponentially like 300 ms for the first, 500 ms for the second, 800 ms for the third, etc.

And right now we taking a look at ElasticSearch specifically for mget.

So my questions are:

Which database would you recommend for such a use case?
What would you recommend besides what we've thought of maybe some other ideas like splitting into two different columns cities IDs, etc?
What would be the best ways to optimize such a database?

Solution

My assumption here is using FLOAT in place of the actual composite key is slowing the query engine's ability to create the proper estimates.

So instead of using the "clever" column, just use a composite primary key:

CREATE TABLE Country
(
  CountryCd  CHAR(3)      NOT NULL  --ISO Country Code
 ,Name       VARCHAR(50)  NOT NULL
 ,CONSTRAINT PK_Country PRIMARY KEY (CountryCd)
 ,CONSTRAINT AK_Country UNIQUE (Name)
)
;
CREATE TABLE Subdivision  --State/province/region/county etc.
(
  CountryCd      CHAR(3)      NOT NULL
 ,SubdivisionCd  CHAR(3)      NOT NULL  --ISO subdivision code
 ,Name           VARCHAR(50)  NOT NULL
 ,CONSTRAINT FK_Subdivision_Division_Of_Country FOREIGN KEY (CountryCd) REFERENCES Country (CountryCd)
 ,CONSTRAINT PK_Subdivision PRIMARY KEY (CountryCd, SubdivisionCd)
 ,CONSTRAINT AK_Subdivision UNIQUE (CountryCd, Name)
)
;
CREATE TABLE City
(
  CityId         INT          NOT NULL
 ,CountryCd      CHAR(3)      NOT NULL
 ,SubdivisionCd  CHAR(3)      NOT NULL
 ,Name           VARCHAR(50)  NOT NULL
 ,CONSTRAINT FK_City_Located_In_Subdivision FOREIGN KEY (CountryCd, SubdivisionCd) REFERENCES Subdivision (CountryCd, SubdivisionCd)
 ,CONSTRAINT PK_City PRIMARY KEY (CityId)
 ,CONSTRAINT AK_City UNIQUE (CountryCd, SubdivisionCd, Name)
)
;
CREATE TABLE CityRoadDistance
(
  SourceCityId       INT  NOT NULL
 ,DestinationCityId  INT  NOT NULL
 ,RoadDistance       INT  NOT NULL
 ,CONSTRAINT FK_CityDistance_From_City FOREIGN KEY (SourceCityId) REFERENCES City (CityId)
 ,CONSTRAINT FK_CityDistance_To_City FOREIGN KEY (DestinationCityId) REFERENCES City (CityId)
 ,CONSTRAINT PK_CityDistance PRIMARY KEY (SourceCityId, DestinationCityId)
 ,CONSTRAINT CK_CityDistance_SourceCityId_LT_DestinationCityId CHECK (SourceCityId < DestinationCityId) --Enforces only one record per city pair
)
;

Easy enough to find the single distance:

SELECT
  RoadDistance
FROM
  CityDistance
WHERE
  SourceCityId = 7 
    AND DestinationCityId = 235

But as you said the secondary requirement is to return 1,000 in less than 50 ms. You won't be able to use IN, so you will need to build a list for the WHERE clause (e.g. (SourceCityId = 7 AND DestinationCityId = 235) OR (SourceCityId = 1345 AND DestinationCityId = 2934)...).

This should allow the query engine to look at things and say: "Oh, I only need to perform 1,000 seeks" and return your data relatively quickly. However for 50 ms or less you're going to need some very fast SSDs or the pages already cached in memory.

Alternatives would be something like an in-memory key-value store. This would cut down on compilation and seek time somewhat, but I'd try using an actual composite key and seeing what happens.

OTHER TIPS

SQL is a relationnal database. If you want this to perform well in SQL, you will have to review the design so that it is normalized. Otherwise, SQL will have to read all row in order to find the "id"s which will not perform well.

If you had a table with "From", "to" and "distance" that contains the ID of the first city, id of the second city and the distance, then indexes would be usable to seek into the right records and you would get great performance from SQL.

P.s. Using a float type to store "relation" is not a great way to do stuff in SQL.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange