web crawler cannot exceed about 1MB/sec speed

Question 1

There are several factors you need to keep in mind when optimizing crawling speed.

Connection locality

In order to re-use connections effectively, you need to make sure that you're reusing connections for the same website. If you wait too long to hit an earlier host a second time, the connection could time out and that's no good. Opening new sockets is a relatively expensive operation so you want to avoid it at all costs. A naive heuristic to achieve this is to sort your download targets by host and download one host at a time, but then you run into the next problem...

Spreading the load between hosts

Not all hosts have fat pipes, so you'll want to hit multiple hosts simultaneously—this also helps avoiding spamming a single host too much. A good strategy here is to have multiple workers, where each worker focuses on one host at a time. This way you can control the rate of downloads per host within the context of each worker, and each worker will maintain its own connection pool to reuse connections from.

Worker specialization

One way to ruin your throughput is to mix your data processing routines (parse the HTML, extract links, whatever) with the fetching routines. A good strategy here is to do the minimal amount of processing work in the fetching workers, and simply save the data for a separate set of workers to pick up later and process (maybe on another machine, even).

Keeping these things in mind, you should be able to squeeze more out of your connection. Some unrelated suggestions: Consider using wget, you'd be surprised at how effective it is in doing simple crawls (it can even read from a giant manifest file).

Question 2

I don't think you can expect to get anywhere near your internet connection's max throughput when doing web scraping.

Scraping (and web browsing in general) involves making a lot of small requests. A good deal of that time is spent in connection set-up and tear down, and waiting on the remote end to begin delivering your content. I'd guess that the time spent actively downloading content is probably around 50%. If you were downloading a bunch of big files, then I think you'd see better average throughput.

Question 3

Try scrapy with scrapy-redis.

You will have to tune the settings: CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS_PER_IP. Also make sure you have DOWNLOAD_DELAY = 0 and AUTOTHROTTLE_ENABLED = False.