Question

I have a data file with 800 million rows and 3 columns. The csv file size is 30 GB.

I need to do some analysis on the data. It took very long time to load it to SQL server. Also, it took about 10 minutes to a SQL query like:

 SELECT MAX(VALUE) AS max_s
 FROM [myDB].[dbo].[myTable]

Also, I need to do other statistics analysis for each column.

 SELECT COUNT(*) as num_rows, COUNT(DISTINCT VARIABLE1) as num_var1 FROM [myDB].[dbo].[myTable]

If I want to improve the analysis/query efficiency , SQL server or other tools can help me ?

How about R ? But, my laptop has only 8 GB memory. It is impossible to load the whole data in to a data frame.

More info about data is here get statistics information by SQL query efficiently for table with 3 columns and 800 million rows

Some solutions have been given. I really appreciate. But, I would like to find out whether we have more efficient solutions.

Was it helpful?

Solution

You can greatly speed up your SQL queries by indexing your data, especially with large tables.

CREATE CLUSTERED INDEX index_name
ON [myDB].[dbo].[myTable] (value, cardID, locationID)

The command above creates a clustered index for your table. Place your actual column names within the round brackets. A clustered index sorts your rows in the order specified within the round brackets. You can create additional non-clustered indexes, but it is generally advisable to have at least one clustered index on your table.

If you have a unique identifier (e.g. an id for each observation that is truly distinct) in your data, you can create a UNIQUE INDEX by using the CREATE UNIQUE INDEX statement. This is generally the best way to speed up your queries.

Generally speaking, again, you should index your data in descending order of cardinality; this means that the columnn with most distinct values goes first in your "ON table (...)" statement, followed by columns with gradually fewer distinct values.

Index syntax

Some more information on indexes

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top