How are web pages scraped and how to protect againist someone doing it?

Question 1

Ok so to answer your questions one by one; how do you know that a 'hidden' (unlinked) directory is on the site? Well you don't, but you can check the most common directory names, whether they return HTTP 200 or 404... With couple of threads you will be able to check even thousands a minute. That being said, you should always consider the amount of requests you are making in regards to the specific website and the amount of traffic it handles, because for small to mid-sized websites this could cause connectivity issues or even a short DoS, which of course is undesirable. Also you can use search engines to search for unlinked content, it may have been discovered by the search engine on accident, there might have been a link to it from another site etc. (for instance google site:targetsite.com will list all the indexed pages). How you download all pages of a website has already been answered, essentially you go to the base link, parse the html for links, images and other content which points to a onsite content and follow it. Further you deconstruct links to their directories and check for indexes. You will also bruteforce common directory and file names.

Well you really effectively can't protect against bots, unless you limit user experience. For instance you could limit the number of requests per minute; but if you have ajax site, a normal user will also be producing a large number of requests so that really isn't a way to go. You can check user agent and white list only 'regular' browsers, however most scraping scripts will identify themselves as regular browsers so that won't help you much either. Lastly you can blacklist IPs, however that is not very effective, there is plenty of proxies, onion routing and other ways to change your IP.

You will get directory list only if a) it is not forbidden in the server config and b) there isn't the default index file (default on apache index.html or index.php).

In practical terms it is good idea not to make it easier to the scraper, so make sure your website search function is properly sanitized etc. (it doesn't return all records on empty query, it filters % sign if you are using LIKE mysql syntax...). And of course use CAPTCHA if appropriate, however it must be properly implemented, not a simple "what is 2 + 2" or couple of letters in common font with plain background.

Another protection from scraping might be using referer checks to allow access to certain parts of the website; however it is better to just forbid access to any parts of the website you don't want public on server side (using .htaccess for example).

Lastly from my experience scrapers will only have basic js parsing capabilities, so implementing some kind of check in javascript could work, however here again you'd also be excluding all web visitors with js switched off (and with noscript or similar browser plugin) or with outdated browser.

Question 2

To fully "download" a site you need a web crawler, that in addition to follow the urls also saves their content. The application should be able to :

Parse the "root" url
Identify all the links to other pages in the same domain
Access and download those and all the ones contained in these child pages
Remember which links have already been parsed, in order to avoid loops

A search for "web crawler" should provide you with plenty of examples.

I don't know counter measures you could adopt to avoid this: in most cases you WANT bots to crawl your websites, since it's the way search engines will know about your site.

I suppose you could look at traffic logs and if you identify (by ip address) some repeating offenders you could blacklist them preventing access to the server.