Can you programmatically connect to a sequence of web-pages and parse the source HTML without imposing stress on the system or raising red flags?

StackOverflow https://stackoverflow.com/questions/11008673

Question

I am working on a project in NLP requiring me to download quite a few video game reviews --- about 10,000 per website. So, I am going to write a program that goes to each URL and pulls out the review part of each page as well as some additional metadata.

I'm using Java and was planning on just opening an HttpURLConnection and reading the text through an input stream. Then, closing the connection and opening the next one.

My questions are this:

1) Let's assume this is a site with medium-to-small amounts of traffic: normally, they receive about 1000 requests per second from normal users. Is it possible that my program would cause undue stress to their system, impacting the user experience for others?

2) Could these connections made one right after another appear as some kind of malicious attack?

Am I being paranoid, or is this an issue? Is there a better way to go about getting this data? I am going to several websites so working individually with site administrators is inconvenient and probably impossible.

Was it helpful?

Solution

If you mimic a web browser, and extract text at human speeds (that is, it normally takes a human several seconds to "click thru" to the next page even if they aren't reading the text), then the server can't really tell what the client is.

In other words, just throttle your slurping to 1 page per few seconds, and no problems.

The other concern you ought to have is legality. I assume these reviews are material that you didn't write, and have no permission to create derivative works from. If you are just slurping them for personal use, then its ok. If you are slurping them to create something (a derivative work), then you are breaking copyright.

OTHER TIPS

I believe you are misunderstanding how HTTP requests work. You ask for a page and you get it... the fact that you're reading a stream one line at a time has no bearing on the HTTP request and the site is perfectly happy to give you your 1 page at a time. It won't look malicious (cause it's just 1 users reading pages... totally normal behavior). You're 100% ok to proceed with your plan (if it is as you described it).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top