Html Agility Pack, Web Scraping, and spoofing in C#

Question 1

Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.

Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference).

In regards to "getting blocked after a certain amount of calls" - throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.

Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.

Question 2

I do way too much web scraping, but here are the options: I have a default list of headers I add as all of these are expected from a browser:

        wc.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11";
        wc.Headers[HttpRequestHeader.ContentType] = "application/x-www-form-urlencoded";
        wc.Headers[HttpRequestHeader.Accept] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
        wc.Headers[HttpRequestHeader.AcceptEncoding] = "gzip,deflate,sdch";
        wc.Headers[HttpRequestHeader.AcceptLanguage] = "en-GB,en-US;q=0.8,en;q=0.6";
        wc.Headers[HttpRequestHeader.AcceptCharset] = "ISO-8859-1,utf-8;q=0.7,*;q=0.3";

(WC is my WebClient).

As a further help - here is my webclient class that keeps cookies stored - which is also a massive help:

public class CookieWebClient : WebClient
{

    public CookieContainer m_container = new CookieContainer();
    public WebProxy proxy = null;

    protected override WebRequest GetWebRequest(Uri address)
    {
        try
        {
            ServicePointManager.DefaultConnectionLimit = 1000000;
            WebRequest request = base.GetWebRequest(address);
            request.Proxy = proxy;

            HttpWebRequest webRequest = request as HttpWebRequest;
            webRequest.Pipelined = true;
            webRequest.KeepAlive = true;
            if (webRequest != null)
            {
                webRequest.CookieContainer = m_container;
            }

            return request;
        }
        catch
        {
            return null;
        }
    }
}

Here is my usual use for it. Add a static copy to your base site class with all your parsing functions you likely have:

    protected static CookieWebClient wc = new CookieWebClient();

And call it as such:

public HtmlDocument Download(string url)
    {
        HtmlDocument hdoc = new HtmlDocument();
        HtmlNode.ElementsFlags.Remove("option");
        HtmlNode.ElementsFlags.Remove("select");
        Stream read = null;
        try
        {
            read = wc.OpenRead(url);
        }
        catch (ArgumentException)
        {
            read = wc.OpenRead(HttpHelper.HTTPEncode(url));
        }

        hdoc.Load(read, true);


        return hdoc;
    }

The other main reason you may be crashing out is the connection is being closed by the server as you have had an open connection for too long. You can prove this by adding a try catch around the download part as above and if it fails, reset the webclient and try to download again:

HtmlDocument d = new HtmlDocument();
                            try
                            {
                                d = this.Download(prp.PropertyUrl);
                            }
                            catch (WebException e)
                            {
                                this.Msg(Site.ErrorSeverity.Severe, "Error connecting to " + this.URL + " : Resubmitting..");
                                wc = new CookieWebClient();
                                d = this.Download(prp.PropertyUrl);
                            }

This saves my ass all the time, even if it was the server rejecting you, this can re-jig the lot. Cookies are cleared and your free to roam again. If worse truly comes to worse - add proxy support and get a new proxy applied per 50-ish requests.

That should be more than enough for you to kick your own and any other sites arse.

RATE ME!