get statistics information by SQL query efficiently for table with 3 columns and 800 million rows

StackOverflow https://stackoverflow.com/questions/23390699

سؤال

I have a data table with 3 columns and 800 million rows in SQL server database.

  locationID     cardID    value
  NY_xxxxx     xxxxxx    xxxx.xxxx   // x : integer digits from 0 to 9
  NY_xxxxx     xxxxxx    xxxx.xxxx  
  NY_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  IA_xxxxx     xxxxxx    xxxx.xxxx  
  ...

I need to count the distinct numbers of cardID for the same location.

Also, I need to count location numbers in the same state. For example, for NY in the above table, we have 3 locations.

I also need to know how many locations in each state, and how many cardID in each location and how many cardID in each state.

How to get these statistics information by SQL query efficiently? The data table size is large.

هل كانت مفيدة؟

المحلول

The OP probably already knows this, but here is how to get the answer, disregarding efficiency. First, cards per location, as described in the comments:

SELECT locationid, COUNT(DISTINCT cardID)
FROM table 
GROUP BY locationid 

Next the same thing per state.

SELECT substring(locationid, 1, 2) state, COUNT(DISTINCT cardID)
FROM table 
GROUP BY substring(locationid, 1, 2)

For a single state,

select COUNT(DISTINCT cardID)
from table 
where substring(locationid, 1, 2) = 'NY'

The problem with the first query is that it will return locationID's like NY_1234. If you have not memorized what NY_1234 is in layman's terms, Rochester for example, you have to go outside your database to see what it is.

The second query will be inefficient because it has to apply the substring function on a large number of records. The third will be inefficient because it will have to scan the entire table.

If you had a relational model you could do stuff like this.

select municipality, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by municipality

select stateCode, count(distinct cardID)
from table join location on table.locationid = location.locationID
group by stateCode

State codes would be NY, NJ, etc If you wanted New York, New Jersey, etc, that would simply be another field.

Finally,

select count(distinct cardID)
from table join location on table.locationid = location.locationID
where stateCode = 'NY'

The second two relational queries would be faster than the single table queries because they wouldn't have to use functions. You could speed it up even more with an index on stateCode, but even if you didn't, a much smaller table would be scanned.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top