سؤال

i am having a little problem with the answer stated at Python progress bar and downloads

if the data downloaded was gzip encoded, the content length and the total length of the data after joining them in the for data in response.iter_content(): is different as in it is bigger cause automatically decompresses gzip-encoded responses

so the bar get longer and longer and once it become to long for a single line, it start flooding the terminal

a working example of the problem (the site is the first site i found on google that got both content-length and gzip encoding):

import requests,sys

def test(link):
    print("starting")
    response = requests.get(link, stream=True)
    total_length = response.headers.get('content-length')
    if total_length is None: # no content length header
        data = response.content
    else:
        dl = 0
        data = b""
        total_length = int(total_length)
        for byte in response.iter_content():
            dl += len(byte)
            data += (byte)
            done = int(50 * dl / total_length)
            sys.stdout.write("\r[%s%s]" % ('=' * done, ' ' * (50-done)))
            sys.stdout.flush()
    print("total data size: %s,  content length: %s" % (len(data),total_length))

test("http://www.pontikis.net/")

ps, i am on linux but it should effect other os too (except windows cause \r doesn't work on it iirc)

and i am using requests.Session for cookies (and gzip) handling so a solution with urllib and other module isn't what i am looking for

هل كانت مفيدة؟

المحلول

Perhaps you should try disabling gzip compression or otherwise accounting for it.

The way to turn it off for requests (when using a session as you say you are):

import requests

s = requests.Session()
del s.headers['Accept-Encoding']

The header sent will now be: Accept-Encoding: Identity and the server should not attempt to use gzip compression. If instead you're trying to download a gzip-encoded file, you should not run into this problem. You will receive a Content-Type of application/x-gzip-compressed. If the website is gzip compressed, you'll receive a Content-Type of text/html for example and a Content-Encoding of gzip.

If the server always serves compressed content then you're out of luck, but no server should do that.


If you want to do something with the functional API of requests:

import requests

r = requests.get('url', headers={'Accept-Encoding': None})

Setting the header value to None via the functional API (or even in a call to session.get) removes that header from the requests.

نصائح أخرى

You could replace...

dl += len(byte)

...with:

dl = response.raw.tell()

From the documentation:

tell(): Obtain the number of bytes pulled over the wire so far. May differ from the amount of content returned by :meth:HTTPResponse.read if bytes are encoded on the wire (e.g, compressed).

Here is a simple process bar implement with tqdm:

def _reader_generator(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024 * 1024)

def raw_newline_count_gzip(fname):
    f = gzip.open(fname, 'rb')
    f_gen = _reader_generator(f.read)
    return sum(buf.count(b'\n') for buf in f_gen)


num = raw_newline_count_gzip(fname)
(loop a gzip file):
    with tqdm(total=num_ids) as pbar:
        # do whatever you want
        pbar.update(1)

The bar looks like: 35%|███▌ | 26288/74418 [00:05<00:09, 5089.45it/s]

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top