Question

I have query as below:

SELECT 
ducc.*, dl.LOCATIONID, dl.LOCATIONNAME
FROM [table1] ducc
LEFT OUTER JOIN EACH [table2] dl
ON ducc.LOCATIONID = dl.LOCATIONID
WHERE ABS(ducc.LOCATIONID % 30) = 0

It's giving me "Shuffle failed with error: Cannot shuffle more than 3.00T in a single shuffle. One of the shuffle partitions in this query exceeded 7.68G. Strategies for working around this error are available on the go/dremelfaq."

I would assume it's not able to sort and shuffle it properly as I am getting two columns from [table2] as the complexity of permutation is high.

Any work around for this?

Was it helpful?

Solution 2

Thanks Jordan for the insight

I guess case 2 was the cause of the problem

"The distribution of join keys is highly unbalanced. That is, if one LOCATIONID accounts for a large proportion of the rows in table1. Sometimes this might be expected. Sometimes, it is because of default value. For example, if table1 has a lot of rows where the LOCATIONID is not known, so by convention, 0 is used, this would mean that a lot of data gets hashed to the same location."

Most of the values in table1.LOCATIONID were NULL. So even though I had all the table2.LOCATIONID were unique it was failing.

As soon as I joined with column which has 99% of distinct values on both table1 and table2 then it worked like a charm

OTHER TIPS

There are a couple of possibilities:

  1. Join explosion. Are the location ids in table2 unique? If not, you could be creating an NxM expansion where all N matching fields in table1 matches with all M matching fields and table2 and creates more rows in the output than in the input.
  2. The distribution of join keys is highly unbalanced. That is, if one LOCATIONID accounts for a large proportion of the rows in table1. Sometimes this might be expected. Sometimes, it is because of default value. For example, if table1 has a lot of rows where the LOCATIONID is not known, so by convention, 0 is used, this would mean that a lot of data gets hashed to the same location.
  3. It is also possible that this is something that should just work. If you provide a job id of a failed job, one of the BigQuery engineers can look up the issue and see what went wrong.

Note for these issues, the partitioning you're doing (ABS(ducc.LOCATIONID % 30 = 0) won't necessarily help, since the values that satisfy this will all get hashed to the same location.

You have a couple of things you can try:

  1. If you've got a join explosion, you could do a GROUP EACH BY in a subselect in the right side of the join so you only get distinct values. For example:

    SELECT 
    ducc.*, dl.LOCATIONID, dl.LOCATIONNAME
    FROM [table1] ducc
    LEFT OUTER JOIN EACH 
        (SELECT LOCATIONID, MIN(LOCATIONNAME) as LOCATIONNAME 
         FROM [table2] 
         GROUP EACH BY LOCATIONID)  dl
    ON ducc.LOCATIONID = dl.LOCATIONID
    WHERE ABS(ducc.LOCATIONID % 30) = 0
    
  2. Remove the EACH qualifier. This means that you won't have to do a shuffle. This only works if table2 is small enough. However, you can apply the filtering to that table instead, which may help, as in:

    SELECT 
    ducc.*, dl.LOCATIONID, dl.LOCATIONNAME
    FROM [table1] ducc
    LEFT OUTER JOIN 
        (SELECT LOCATIONID, LOCATIONNAME
         FROM [table2] 
         WHERE ABS(ducc.LOCATIONID % 30) = 0)  dl
    ON ducc.LOCATIONID = dl.LOCATIONID
    
  3. If the problem is that one of your hash buckets is too large because of a dummy value that gets matched, you can, of course, try filtering it out. If it is a legitimate value that has a large proportion of the matches, you can break up the query into pieces, doing the part that has 'too many matches' first as a JOIN without the each, and the other parts as a separate query with JOIN EACH. You can concatenate the results together by specifying that you want to append the results of the first query to the second.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top