Inner join and Split on large volume of data

https://stackoverflow.com/questions/13092561

14-07-2021
|

문제

We are working on large volume data (row counts given below) :

Table 1 : 708408568 rows  -- 708 million
Table 2 : 1416817136 rows -- 1.4 billion

Table 1 Schema:
----------------
ID -      Int PK
column2 - Int

Table 2 Schema
----------------
Table1ID - Int FK
SomeColumn - Int
SomeColumn - Int

Table1 has PK1 which servers as FK for Table 2.

Index details :

Table1 : 
PK Clustered Index on Id
Non Clustered (Non Unique) on column2

Table 2 :
Table1ID (FK) Clustered Index

Below is the query which needs to be executed :

SELECT t1.[id]
      ,t1.[column2]
FROM  Table1 t1
inner join Table2 t2
    on s.id = cs.id
WHERE t1.[column2] in (select [id] from ConvertCsvToTable('1,2,3,4,5.......10000')) -- 10,000 Comma seperated Ids

So to summarize, The inner join on ID should be handled by the clustered index on the same Ids on both PK and FK. and as for the "huge" Where condition on column2 we have a nonclustered index.

However, the query is taking 4 minutes for a small subset of 100 Ids, we need to pass 10,000 ids.

Is there a better way design wise that we can do this, or possibly does Table Partitioning help?

Just wanted to get some ways of how to solve huge volume Select with Inner Join and Where IN.

Note : ConvertCsvToTable is a Split function which has already been determined to perform optimally.

Thanks !

해결책

This is what I would try: Create a temp table with the structure of the return from the function. Make sure to set the column ID as primary key so that the optimizer takes it into consideration...

CREATE TABLE #temp
(id    int          not null
    ...
,PRIMARY KEY (id) )

then call the function

insert into #temp exec ConvertCsvToTable('1,2,3,4,5.......10000')

then use the temp table directly joined in the query

SELECT t1.[id], t1.[column2]
FROM  Table1 t1, t2, #temp
where t1.id = t2.id
  and t1.[column2] = #temp.id

다른 팁

Bring the condition into the join
It gives the optimizer a chance to first filter by t1.[column2] first
Try different hash hints

SELECT t1.[id], t1.[column2]
FROM  Table1 t1 with (nolock)
inner join Table2 t2 with (nolock)
   on s.id = cs.id
  and t1.[column2] in (select [id] from ConvertCsvToTable('1,2,3,4,5.......10000'))

You may need to tell it to use that index on Column2.
But give it a chance to do the right thing.
In the where you were not giving it a chance to do the right thing.

If you go with #temp then try
(and declare a PK on the temp as Rodolfo stated +1)
This will pretty much force it to start with small table
It could still get stupid do the join on T2 first but I doubt it.

SELECT t1.[id], t1.[column2]
FROM #temp 
JOIN Table1 t1 with (nolock)
  on t1.[column2] = #temp.ID 
join Table2 t2 with (nolock)
   on t2.ID = t1.ID

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow