Question

We are learning to use JSOUP and urlconnection so we are parsing pages from a website of our choosing and parsing the pages to answer interesting questions.

Everything works well, however every now an then I get a SocketTimeOutException. I think this is because the website disconnects my program (or times me out, or throttles, or something).

I have implemented a random sleep between 0-30 seconds every time a new page is downloaded. and I think it helps but it still happens. So now I try to catch the exception and sleep for 15 minutes before recursively trying again.

Is there a better way to handle this? Is this the reason I am getting the exception?

Also, would it help to change IP somehow every few minutes (and is that possible in Java)? Thanks

Was it helpful?

Solution

Everything works well, however every now an then I get a SocketTimeOutException. I think this is because the website disconnects my program (or times me out, or throttles, or something).

Connection failing in HTTP is expected. That's the nature of the protocol. There can be many reasons for that (your newtork is unstable, their network is unstable, their firewall thinks you're attacking them and blocks, YOUR firewall thinks you're under attack and blocks).

I have implemented a random sleep between 0-30 seconds every time a new page is downloaded. and I think it helps but it still happens. So now I try to catch the exception and sleep for 15 minutes before recursively trying again.

I'd sleep everytime I sucessfully get a page OR everytime there is an error and then retry. I wouldn't wait so much, though (15mins?), I'd make it 1min tops for both.

Is there a better way to handle this? Is this the reason I am getting the exception?

As said, you get the exception due to the network. There's nothing you can do about it, this is normal network behavior.

Also, would it help to change IP somehow every few minutes (and is that possible in Java)?

Would help if the target website does some kind of logging and blocks an IP address after n requests. Still, you can't change it the way you want it through Java. The IP address belongs to the machine (not the program) and most of the time is assigned by somebody else, not you.

You could make the HTTP requests through proxies, and then their IP addresses is what will reach the target server (and you would change the proxy when one gets banned), but this will make your connection even more unstable, since you are adding one more layer on the "transaction".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top