Question

We're working on writing a wrapper for bq.py and are having some problems with result sets larger than 100k rows. It seems like in the past this has worked fine (we had related problems with Google BigQuery Incomplete Query Replies on Odd Attempts). Perhaps I'm not understanding the limits explained on the doc page?

For instance:

#!/bin/bash

for i in `seq 99999 100002`;
do
    bq query -q --nouse_cache --max_rows 99999999 "SELECT id, FROM [publicdata:samples.wikipedia] LIMIT $i" > $i.txt
    j=$(cat $i.txt | wc -l)
    echo "Limit $i Returned $j Rows"
done

Yields (note there are 4 lines of formatting):

Limit 99999 Returned   100003 Rows
Limit 100000 Returned   100004 Rows
Limit 100001 Returned   100004 Rows
Limit 100002 Returned   100004 Rows

In our wrapper, we directly access the API:

while row_count < total_rows:
    data = client.apiclient.tabledata().list(maxResults=total_rows - row_count,
                                                 pageToken=page_token,
                                                 **table_dict).execute()

    # If there are more results than will fit on a page, 
    # you will recieve a token for the next page
    page_token = data.get('pageToken', None)

    # How many rows are there across all pages?
    total_rows = min(total_rows, int(data['totalRows'])) # Changed to use get(data[rows],0)
    raw_page = data.get('rows', [])

We would expect to get a token in this case, but none is returned.

Was it helpful?

Solution

sorry it took me a little while to get back to you.

I was able to identify a bug that exists server-side, you would end up seeing this with the Java client as well as the python client. We're planning on pushing a fix out this coming week. Your client should start to behave correctly as soon as that happens.

BTW, I'm not sure if you knew this already or not but there's a whole standalone python client that you can use to access the API from python as well. I thought that might be a bit more convenient for you than the client that's distributed as part of bq.py. You'll find a link to it on this page: https://developers.google.com/bigquery/client-libraries

OTHER TIPS

I can reproduce the behavior you're seeing with the bq command-line. That seems like a bug, I'll see what I can do to fix it.

One thing I did notice about the data you're querying was that selecting the id field only, and capping the number of rows around 100,000. This produces about ~1M worth of data so the server would likely not paginate the results. Selecting a larger amount of data will force the server to paginate since it will not be able to return all the results in a single response. If you did a select * for 100,000 rows of samples.wikipedia you'd be getting ~50M back which should be enough to start to see some pagination happen.

Are you seeing too few results come back from the python client as well or were you surprised that no page_token was returned for your samples.wikipedia query?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top