fastest way to get process and upload 42 million rows

https://stackoverflow.com/questions/23276760

09-07-2023
|

Question

I have a database that links Ip ranges to a location id in another table. Its so large we are trying to make a new solution using aerospike.

The solution we came up with is to use Intervals. Something like this:

Key: 1000000 (int64 of ip Address) Bin1: default:1 (location id for start of given block)

Bin2: 1234567:2 (First ip in block where location id changes):(locationID)

Bin3: 1345678:3 (second ip in block where location id changes):(locationID)

etc

Using this method we could get the location id from the ip mathmatically while still cutting down on the ammount of rows, and the ammount of server processing time.

I want to do some tests on my Idea but im having a problem converting our current system over.

We have a database that has ranges(eg. 0- 160000) with an associated location id.

The range table has 9,297,631 rows.

A c# script i made executes this sql:

SELECT * FROM dbo.GeoIPRanges 
where (startIpNum BETWEEN 300000000 AND 300000100)
OR (endIpNum BETWEEN 300000000 AND 300000100)
OR (startIpNum <= 300000000 AND endIpNum >= 300000100)

That takes about 4 seconds per call. The numbers above are example numbers. You can see they are in blocks of 100. The max ip ammount is 4,294,967,295. Doing this in blocks of 100 lands me with 42,949,672 calls of about 4 seconds each. which takes a very long time. The processing time it takes to format the information into the fashion i want for aerospike is negligible.

Knowing all this information, is there any ideas on how to speed this up.

Solution

There is an Aerospike-loader tool. If you can dump your data in a csv file format, the loader can load the data into aerospike. It can read multiple CSV files in parallel and load the data into aerospike in parallel. In the internal benchmarks, on decent hardware, we could load upto 200,000 records per second. Read the docs & examples for details.

OTHER TIPS

This may not be what you are thinking, but just yesterday I used R to extract some data sets from SQL Server, and it turned out to be MAGNITUDES faster than SQL Server itself. Do a little research on this methodology, and then try something like this...

library(XLConnect)
library(dplyr)
library(RODBC)

dbhandle <- odbcDriverConnect('driver={SQL Server};server=Server_Name;database=DB_Name;trusted_connection=true')

NEEDDATE <- as.Date('8/20/2017',format='%m/%d/%Y')

# DataSet1
DataSet1<-paste("DECLARE @NEEDDATE nvarchar(25)
SET @NEEDDATE = '2017-07-20'

SELECT      . . .

        )")


DataSet1 <- sqlQuery(dbhandle, DataSet1)
DataSet2 <- sqlQuery(dbhandle, DataSet2)

Combined <- rbind(DataSet1, DataSet2)



ExcelFile <- loadWorkbook("C:/Test/Excel_File.xlsx")


Sheet1=tbl_df(readWorksheet(ExcelFile,sheet="Sheet1"))
Sheet2=tbl_df(readWorksheet(ExcelFile,sheet="Sheet2"))


write.csv(DataSet1, paste0("C:/Test/Output_", NEEDDATE, ".csv"), row.names = FALSE)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow