Frage

I'm writing a web crawler in C++ and want to figure out the best approach to read the response of a http request.

Currently I'm using a buffer of 1M size to contain the recv message in read() (count is 4K bytes for read()). This is the max size of webpage I would like to crawl. However this is sort of waste so I'm also thinking about other approaches as below:

  1. Send http HEAD request in the first round and read the content-length info from header. Create a char array with size of content-length and send http GET then to retrieve the content.
    Q1: What if the header info from server does not have content-lenght?
    Q2: This approach doubles the network traffic. Is it worthy paying such overhead?

  2. Send http GET directly and using a smaller buffer (e.g. 16K bytes). But not processing the response until all data is received, instead processing the data once the buffer is full and then clean the buffer to receive the rest.
    Q1: In this way the crawler may need a few iterations to read a large webpage completely. If the processing job is time-costly and multiple webpages are being read at the same time, could the waiting data from network exceeds system buffer and cause packet loss?

Thanks.

War es hilfreich?

Lösung

Currently I'm using a buffer of 1M size to contain the recv message in read() (count is 4K bytes for read()). This is the max size of webpage I would like to crawl. However this is sort of waste

It certainly is. You won't read more than a couple of K per read operation anyway, so a huge buffer is pointless.

Send http HEAD request in the first round and read the content-length info from header. Create a char array with size of content-length and send http GET then to retrieve the content.

That's another network operation. Also wasteful.

Q1: What if the header info from server does not have content-length?

Not sure that's valid for HEAD, but you would have to check the RFC.

Q2: This approach doubles the network traffic.

No it doesn't. It doubles the number of request/response pairs. It's not the same thing.

Is it worthy paying such overhead?

No.

Send http GET directly and using a smaller buffer (e.g. 16K bytes).

Definitely.

But not processing the response until all data is received

Why not? Why not process it as you receive it? That's the best approach of all. Smallest buffer, lowest latency.

instead processing the data once the buffer is full and then clean the buffer to receive the rest.

You never need to clean a buffer.

Q1: In this way the crawler may need a few iterations to read a large webpage completely

You always need iterations to read a Web page or anything else from a network. The recv() function is only guaranteed to transfer at least one byte in blocking model unless EOS or a error occurs. It's not obliged to fill the buffer, and it can't unless your socket receive buffer is also 1M and you've wasted enough time between reads that it has filled. If you program properly this won't happen.

If the processing job is time-costly and multiple webpages are being read at the same time, could the waiting data from network exceeds system buffer and cause packet loss?

Not in TCP. It will just cause the sender to stall and waste time.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top