Im not talking about extracting a text, or downloading a web page. but I see people downloading whole web sites, for example, there is a directory called "example" and it isnt even linked in web site, how do I know its there? how do I download "ALL" pages of a website? and how do I protect against?

for example, there is "directory listing" in apache, how do I get list of directories under root, if there is a index file already?

this question is not language-specific, I would be happy with just a link that explains techniques that does this, or a detailed answer.

有帮助吗?

解决方案

Ok so to answer your questions one by one; how do you know that a 'hidden' (unlinked) directory is on the site? Well you don't, but you can check the most common directory names, whether they return HTTP 200 or 404... With couple of threads you will be able to check even thousands a minute. That being said, you should always consider the amount of requests you are making in regards to the specific website and the amount of traffic it handles, because for small to mid-sized websites this could cause connectivity issues or even a short DoS, which of course is undesirable. Also you can use search engines to search for unlinked content, it may have been discovered by the search engine on accident, there might have been a link to it from another site etc. (for instance google site:targetsite.com will list all the indexed pages). How you download all pages of a website has already been answered, essentially you go to the base link, parse the html for links, images and other content which points to a onsite content and follow it. Further you deconstruct links to their directories and check for indexes. You will also bruteforce common directory and file names.

Well you really effectively can't protect against bots, unless you limit user experience. For instance you could limit the number of requests per minute; but if you have ajax site, a normal user will also be producing a large number of requests so that really isn't a way to go. You can check user agent and white list only 'regular' browsers, however most scraping scripts will identify themselves as regular browsers so that won't help you much either. Lastly you can blacklist IPs, however that is not very effective, there is plenty of proxies, onion routing and other ways to change your IP.

You will get directory list only if a) it is not forbidden in the server config and b) there isn't the default index file (default on apache index.html or index.php).

In practical terms it is good idea not to make it easier to the scraper, so make sure your website search function is properly sanitized etc. (it doesn't return all records on empty query, it filters % sign if you are using LIKE mysql syntax...). And of course use CAPTCHA if appropriate, however it must be properly implemented, not a simple "what is 2 + 2" or couple of letters in common font with plain background.

Another protection from scraping might be using referer checks to allow access to certain parts of the website; however it is better to just forbid access to any parts of the website you don't want public on server side (using .htaccess for example).

Lastly from my experience scrapers will only have basic js parsing capabilities, so implementing some kind of check in javascript could work, however here again you'd also be excluding all web visitors with js switched off (and with noscript or similar browser plugin) or with outdated browser.

其他提示

To fully "download" a site you need a web crawler, that in addition to follow the urls also saves their content. The application should be able to :

  • Parse the "root" url
  • Identify all the links to other pages in the same domain
  • Access and download those and all the ones contained in these child pages
  • Remember which links have already been parsed, in order to avoid loops

A search for "web crawler" should provide you with plenty of examples.

I don't know counter measures you could adopt to avoid this: in most cases you WANT bots to crawl your websites, since it's the way search engines will know about your site.

I suppose you could look at traffic logs and if you identify (by ip address) some repeating offenders you could blacklist them preventing access to the server.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top