Which database to use when you have a 6 billion of rows inside and need to query rows with from list of IDs? [closed]
-
08-03-2021 - |
質問
We are currently researching our case of storing the road distances between cities. Right now we have 6 billion of those distances.
Our structure right now in SQL Server is that we have a float
which represents the relationship between cities. For example if we have a city with Id
1 inside the Locations
table and a city with Id
2 inside the same table, a row with distance from 1 to 2 will look like 1.2,'1000 miles'
. That column is indexed.
So to get the distance from the city 1000
to the city 2535
, we would find 1000.2535
inside the Distances
.
Besides getting single distances, we need to select groups of 1000 distances from those 6 billion rows:
SELECT
id
,distance
FROM
Distances
WHERE
id IN (1000.2535, 1.2, ...)
Right now we've only tested SQL Server on a local machine and it gives us around 300 ms for such query of 1000 rows, but only when we set a 50 ms timeout (this is needed for a lot of parallel requests from multiple users), if 50 ms timeout is not used it just grows exponentially like 300 ms for the first, 500 ms for the second, 800 ms for the third, etc.
And right now we taking a look at ElasticSearch specifically for mget
.
So my questions are:
- Which database would you recommend for such a use case?
- What would you recommend besides what we've thought of maybe some other ideas like splitting into two different columns cities IDs, etc?
- What would be the best ways to optimize such a database?
解決
My assumption here is using FLOAT
in place of the actual composite key is slowing the query engine's ability to create the proper estimates.
So instead of using the "clever" column, just use a composite primary key:
CREATE TABLE Country
(
CountryCd CHAR(3) NOT NULL --ISO Country Code
,Name VARCHAR(50) NOT NULL
,CONSTRAINT PK_Country PRIMARY KEY (CountryCd)
,CONSTRAINT AK_Country UNIQUE (Name)
)
;
CREATE TABLE Subdivision --State/province/region/county etc.
(
CountryCd CHAR(3) NOT NULL
,SubdivisionCd CHAR(3) NOT NULL --ISO subdivision code
,Name VARCHAR(50) NOT NULL
,CONSTRAINT FK_Subdivision_Division_Of_Country FOREIGN KEY (CountryCd) REFERENCES Country (CountryCd)
,CONSTRAINT PK_Subdivision PRIMARY KEY (CountryCd, SubdivisionCd)
,CONSTRAINT AK_Subdivision UNIQUE (CountryCd, Name)
)
;
CREATE TABLE City
(
CityId INT NOT NULL
,CountryCd CHAR(3) NOT NULL
,SubdivisionCd CHAR(3) NOT NULL
,Name VARCHAR(50) NOT NULL
,CONSTRAINT FK_City_Located_In_Subdivision FOREIGN KEY (CountryCd, SubdivisionCd) REFERENCES Subdivision (CountryCd, SubdivisionCd)
,CONSTRAINT PK_City PRIMARY KEY (CityId)
,CONSTRAINT AK_City UNIQUE (CountryCd, SubdivisionCd, Name)
)
;
CREATE TABLE CityRoadDistance
(
SourceCityId INT NOT NULL
,DestinationCityId INT NOT NULL
,RoadDistance INT NOT NULL
,CONSTRAINT FK_CityDistance_From_City FOREIGN KEY (SourceCityId) REFERENCES City (CityId)
,CONSTRAINT FK_CityDistance_To_City FOREIGN KEY (DestinationCityId) REFERENCES City (CityId)
,CONSTRAINT PK_CityDistance PRIMARY KEY (SourceCityId, DestinationCityId)
,CONSTRAINT CK_CityDistance_SourceCityId_LT_DestinationCityId CHECK (SourceCityId < DestinationCityId) --Enforces only one record per city pair
)
;
Easy enough to find the single distance:
SELECT
RoadDistance
FROM
CityDistance
WHERE
SourceCityId = 7
AND DestinationCityId = 235
But as you said the secondary requirement is to return 1,000 in less than 50 ms.
You won't be able to use IN
, so you will need to build a list for the WHERE
clause (e.g. (SourceCityId = 7 AND DestinationCityId = 235) OR (SourceCityId = 1345 AND DestinationCityId = 2934)
...).
This should allow the query engine to look at things and say: "Oh, I only need to perform 1,000 seeks" and return your data relatively quickly. However for 50 ms or less you're going to need some very fast SSDs or the pages already cached in memory.
Alternatives would be something like an in-memory key-value store. This would cut down on compilation and seek time somewhat, but I'd try using an actual composite key and seeing what happens.
他のヒント
SQL is a relationnal database. If you want this to perform well in SQL, you will have to review the design so that it is normalized. Otherwise, SQL will have to read all row in order to find the "id"s which will not perform well.
If you had a table with "From", "to" and "distance" that contains the ID of the first city, id of the second city and the distance, then indexes would be usable to seek into the right records and you would get great performance from SQL.
P.s. Using a float type to store "relation" is not a great way to do stuff in SQL.