Question

I am trying to use WebClient to download a file from web using a WinForms application. However, I really only want to download HTML file. Any other type I will want to ignore.

I checked the WebResponse.ContentType, but its value is always null.

Anyone have any idea what could be the cause?

Was it helpful?

Solution

Given your update, you can do this by changing the .Method in GetWebRequest:

using System;
using System.Net;
static class Program
{
    static void Main()
    {
        using (MyClient client = new MyClient())
        {
            client.HeadOnly = true;
            string uri = "http://www.google.com";
            byte[] body = client.DownloadData(uri); // note should be 0-length
            string type = client.ResponseHeaders["content-type"];
            client.HeadOnly = false;
            // check 'tis not binary... we'll use text/, but could
            // check for text/html
            if (type.StartsWith(@"text/"))
            {
                string text = client.DownloadString(uri);
                Console.WriteLine(text);
            }
        }
    }

}

class MyClient : WebClient
{
    public bool HeadOnly { get; set; }
    protected override WebRequest GetWebRequest(Uri address)
    {
        WebRequest req = base.GetWebRequest(address);
        if (HeadOnly && req.Method == "GET")
        {
            req.Method = "HEAD";
        }
        return req;
    }
}

Alternatively, you can check the header when overriding GetWebRespons(), perhaps throwing an exception if it isn't what you wanted:

protected override WebResponse GetWebResponse(WebRequest request)
{
    WebResponse resp = base.GetWebResponse(request);
    string type = resp.Headers["content-type"];
    // do something with type
    return resp;
}

OTHER TIPS

I'm not sure the cause, but perhaps you hadn't downloaded anything yet. This is the lazy way to get the content type of a remote file/page (I haven't checked if this is efficient on the wire. For all I know, it may download huge chunks of content)

        Stream connection = new MemoryStream(""); // Just a placeholder
        WebClient wc = new WebClient();
        string contentType;
        try
        {
            connection = wc.OpenRead(current.Url);
            contentType = wc.ResponseHeaders["content-type"];
        }
        catch (Exception)
        {
            // 404 or what have you
        }
        finally
        {
            connection.Close();
        }

WebResponse is an abstract class and the ContentType property is defined in inheriting classes. For instance in the HttpWebRequest object this method is overloaded to provide the content-type header. I'm not sure what instance of WebResponse the WebClient is using. If you ONLY want HTML files, your best of using the HttpWebRequest object directly.

You could issue the first request with the HEAD verb, and check the content-type response header? [edit] It looks like you'll have to use HttpWebRequest for this, though.

Your question is a bit confusing: if you're using an instance of the Net.WebClient class, the Net.WebResponse doesn't enter into the equation (apart from the fact that it's indeed an abstract class, and you'd be using a concrete implementation such as HttpWebResponse, as pointed out in another response).

Anyway, when using WebClient, you can achieve what you want by doing something like this:

Dim wc As New Net.WebClient()
Dim LocalFile As String = IO.Path.Combine(Environment.GetEnvironmentVariable("TEMP"), Guid.NewGuid.ToString)
wc.DownloadFile("http://example.com/somefile", LocalFile)
If Not wc.ResponseHeaders("Content-Type") Is Nothing AndAlso wc.ResponseHeaders("Content-Type") <> "text/html" Then
    IO.File.Delete(LocalFile)
Else
    '//Process the file
End If

Note that you do have to check for the existence of the Content-Type header, as the server is not guaranteed to return it (although most modern HTTP servers will always include it). If no Content-Type header is present, you can fall back to another HTML detection method, for example opening the file, reading the first 1K characters or so into a string, and seeing if that contains the substring <html>

Also note that this is a bit wasteful, as you'll always transfer the full file, prior to deciding whether you want it or not. To work around that, switching to the Net.HttpWebRequest/Response classes might help, but whether the extra code is worth it depends on your application...

I apologize for not been very clear. I wrote a wrapper class that extends WebClient. In this wrapper class, I added cookie container and exposed the timeout property for the WebRequest.

I was using DownloadDataAsync() from this wrapper class and I wasn't able to retrieve content-type from WebResponse of this wrapper class. My main intention is to intercept the response and determine if its of text/html nature. If it isn't, I will abort this request.

I managed to obtain the content-type after overriding WebClient.GetWebResponse(WebRequest, IAsyncResult) method.

The following is a sample of my wrapper class:

public class MyWebClient : WebClient
{
    private CookieContainer _cookieContainer;
    private string _userAgent;
    private int _timeout;
    private WebReponse _response;

    public MyWebClient()
    {
        this._cookieContainer = new CookieContainer();
        this.SetTimeout(60 * 1000);
    }

    public MyWebClient SetTimeout(int timeout)
    {
        this.Timeout = timeout;
        return this;
    }

    public WebResponse Response
    {
        get { return this._response; }
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        WebRequest request = base.GetWebRequest(address);

        if (request.GetType() == typeof(HttpWebRequest))
        {
            ((HttpWebRequest)request).CookieContainer = this._cookieContainer;
            ((HttpWebRequest)request).UserAgent = this._userAgent;
            ((HttpWebRequest)request).Timeout = this._timeout;
        }

        this._request = request;
        return request;
    }

    protected override WebResponse GetWebResponse(WebRequest request)
    {
        this._response = base.GetWebResponse(request);
        return this._response;
    }

    protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
    {
        this._response = base.GetWebResponse(request, result);
        return this._response;
    }

    public MyWebClient ServerCertValidation(bool validate)
    {
        if (!validate) ServicePointManager.ServerCertificateValidationCallback += delegate(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors) { return true; };
        return this;
    }
}

Here is a method using TCP, which http is built on top of. It will return when connected or after the timeout (milliseconds), so the value may need to be changed depending on your situation

var result = false;
try {
    using (var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)) {
        var asyncResult = socket.BeginConnect(yourUri.AbsoluteUri, 80, null, null);
        result = asyncResult.AsyncWaitHandle.WaitOne(100, true);
        socket.Close();
    }
}
catch { }
return result;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top