How To Compute the Cost of Querying an Additional Table Using LEFT JOIN?

https://stackoverflow.com/questions/21464369

05-10-2022
|

문제

I've been tasked with estimating the marginal cost of joining an additional table in reporting queries written in SQL for the PostgreSQL database. I was given a test program to test the performance of the query. There appears to be no statistically significant difference between run times whether or not the joined table is populated. How can I write a better test to illustrate the difference in query times between the two scenarios?

In each case, the SQL is the same, joining table A with table B. The only difference is whether or not table B contains any data.

Table A has these columns:

    Column    |            Type             
--------------+-----------------------------
 sid          | bigint                      
 cluster      | text                         
 sn           | text                        
 tag_id       | integer                      
 src_ip       | text                         
 dst_ip       | text                         
 dst_port     | integer                      
 protocol     | text                         
 src_intf     | text                         
 dst_intf     | text                         
 disp         | smallint                     
 rcvd_bytes   | bigint                       
 sent_bytes   | bigint                       
 duration     | integer                      
 count        | integer                      
 start_time   | timestamp without time zone  
 policy_id    | text                         
 src_user     | text                         
 dst_domain   | text                         
 app_id       | text                         
 signature_id | text                         
 deny_type_id | text                         
 reputation   | text                         
 wb_cat_id    | text                         
 alarm_name   | text                         
 virus        | text                         
 sender       | text                         
 recipients   | text                         
 host         | text                         
 dlp_rule_id  | text                         
 spam_type    | text                         
 spam_action  | text

Table B has these columns:

    Column    |            Type             
--------------+-----------------------------
 appliance_id | integer                      
 ip           | inet                         
 fqdn         | text                         
 resolve_time | timestamp without time zone  
 expire_time  | timestamp without time zone

The two tables are joined by an IPv4 address of type inet.

Test data populates tables A and B with just over 500 rows of data each.

With table B populated, on average, the test program that runs query takes 18.216 seconds of total time to run. The standard deviation between runs is 1.143 seconds.

With table B empty, on average, the test program that runs query takes 18.523 seconds of total time to run. The standard deviation between runs is 1.928 seconds.

The sample size in each case is six runs. I suspect I'll need to use a much larger sample size, but am not certain what an appropriate sample size would be.

해결책

I wrote a simple query using a LEFT JOIN:

SELECT * FROM a
         LEFT JOIN b
                   ON a.src_ip::inet = b.ip
                   AND b.resolve_time IS NOT NULL AND b.resolve_time <= now()
                   AND b.expire_time IS NOT NULL AND now() < b.expire_time

I then ran 10 tests of 100 iterations (queries) each and timed the results of each test run.

Here are the results:

Query Run Times

By averaging the run times and computing the ratio of the run time with Table B populated to the run time without Table B populated, I was able to compute an overhead of 21.38% for the query with Table B over the query without.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow