Come verificare se System.Net.WebClient.DownloadData sta scaricando un file binario?

https://stackoverflow.com/questions/153451

03-07-2019
|

Domanda

Sto provando a usare WebClient per scaricare un file dal web usando un'applicazione WinForms. Tuttavia, voglio solo scaricare file HTML. Qualsiasi altro tipo che vorrò ignorare.

Ho controllato WebResponse.ContentType, ma il suo valore è sempre null.

Qualcuno ha idea di quale potrebbe essere la causa?

Soluzione

Dato il tuo aggiornamento, puoi farlo modificando il .Method in GetWebRequest:

using System;
using System.Net;
static class Program
{
    static void Main()
    {
        using (MyClient client = new MyClient())
        {
            client.HeadOnly = true;
            string uri = "http://www.google.com";
            byte[] body = client.DownloadData(uri); // note should be 0-length
            string type = client.ResponseHeaders["content-type"];
            client.HeadOnly = false;
            // check 'tis not binary... we'll use text/, but could
            // check for text/html
            if (type.StartsWith(@"text/"))
            {
                string text = client.DownloadString(uri);
                Console.WriteLine(text);
            }
        }
    }

}

class MyClient : WebClient
{
    public bool HeadOnly { get; set; }
    protected override WebRequest GetWebRequest(Uri address)
    {
        WebRequest req = base.GetWebRequest(address);
        if (HeadOnly && req.Method == "GET")
        {
            req.Method = "HEAD";
        }
        return req;
    }
}

In alternativa, puoi controllare l'intestazione quando esegui l'override di GetWebRespons (), magari lanciando un'eccezione se non è quello che volevi:

protected override WebResponse GetWebResponse(WebRequest request)
{
    WebResponse resp = base.GetWebResponse(request);
    string type = resp.Headers["content-type"];
    // do something with type
    return resp;
}

Altri suggerimenti

Non ne sono sicuro, ma forse non hai ancora scaricato nulla. Questo è il modo pigro per ottenere il tipo di contenuto di un file / pagina remoti (non ho verificato se questo è efficiente sulla rete. Per quanto ne so, può scaricare enormi blocchi di contenuti)

        Stream connection = new MemoryStream(""); // Just a placeholder
        WebClient wc = new WebClient();
        string contentType;
        try
        {
            connection = wc.OpenRead(current.Url);
            contentType = wc.ResponseHeaders["content-type"];
        }
        catch (Exception)
        {
            // 404 or what have you
        }
        finally
        {
            connection.Close();
        }

WebResponse è una classe astratta e la proprietà ContentType è definita ereditando le classi. Ad esempio, nell'oggetto HttpWebRequest questo metodo è sovraccaricato per fornire l'intestazione del tipo di contenuto. Non sono sicuro di quale istanza di WebResponse stia utilizzando WebClient. Se si desidera SOLO file HTML, utilizzare al meglio l'oggetto HttpWebRequest direttamente.

Potresti emettere la prima richiesta con il verbo HEAD e controllare l'intestazione della risposta del tipo di contenuto? [modifica] Sembra che dovresti usare HttpWebRequest per questo, tuttavia.

La tua domanda è un po 'confusa: se stai usando un'istanza della classe Net.WebClient, Net.WebResponse non entra nell'equazione (a parte il fatto che è davvero una classe astratta, e tu' utilizzare un'implementazione concreta come HttpWebResponse, come sottolineato in un'altra risposta).

In ogni caso, quando si utilizza WebClient, è possibile ottenere ciò che si desidera facendo qualcosa del genere:

Dim wc As New Net.WebClient()
Dim LocalFile As String = IO.Path.Combine(Environment.GetEnvironmentVariable("TEMP"), Guid.NewGuid.ToString)
wc.DownloadFile("http://example.com/somefile", LocalFile)
If Not wc.ResponseHeaders("Content-Type") Is Nothing AndAlso wc.ResponseHeaders("Content-Type") <> "text/html" Then
    IO.File.Delete(LocalFile)
Else
    '//Process the file
End If

Si noti che è necessario verificare l'esistenza dell'intestazione Content-Type, poiché il server non è garantito per restituirlo (anche se la maggior parte dei moderni server HTTP lo includerà sempre). Se non è presente alcuna intestazione Content-Type, è possibile ricorrere a un altro metodo di rilevamento HTML, ad esempio aprendo il file, leggendo i primi 1K caratteri circa in una stringa e vedendo se contiene la sottostringa & Lt; html < !> gt;

Nota anche che questo è un po 'dispendioso, poiché trasferirai sempre il file completo, prima di decidere se lo vuoi o no. Per ovviare a questo, passare alle classi Net.HttpWebRequest / Response potrebbe aiutare, ma se vale la pena aggiungere il codice aggiuntivo dipende dall'applicazione ...

Mi scuso per non essere stato molto chiaro. Ho scritto una classe wrapper che estende WebClient. In questa classe wrapper, ho aggiunto il contenitore dei cookie ed esposto la proprietà timeout per WebRequest.

Stavo usando DownloadDataAsync () da questa classe wrapper e non sono stato in grado di recuperare il tipo di contenuto da WebResponse di questa classe wrapper. La mia intenzione principale è intercettare la risposta e determinare se è di natura text / html. In caso contrario, interromperò questa richiesta.

Sono riuscito a ottenere il tipo di contenuto dopo aver ignorato il metodo WebClient.GetWebResponse (WebRequest, IAsyncResult).

Di seguito è riportato un esempio della mia classe wrapper:

public class MyWebClient : WebClient
{
    private CookieContainer _cookieContainer;
    private string _userAgent;
    private int _timeout;
    private WebReponse _response;

    public MyWebClient()
    {
        this._cookieContainer = new CookieContainer();
        this.SetTimeout(60 * 1000);
    }

    public MyWebClient SetTimeout(int timeout)
    {
        this.Timeout = timeout;
        return this;
    }

    public WebResponse Response
    {
        get { return this._response; }
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        WebRequest request = base.GetWebRequest(address);

        if (request.GetType() == typeof(HttpWebRequest))
        {
            ((HttpWebRequest)request).CookieContainer = this._cookieContainer;
            ((HttpWebRequest)request).UserAgent = this._userAgent;
            ((HttpWebRequest)request).Timeout = this._timeout;
        }

        this._request = request;
        return request;
    }

    protected override WebResponse GetWebResponse(WebRequest request)
    {
        this._response = base.GetWebResponse(request);
        return this._response;
    }

    protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
    {
        this._response = base.GetWebResponse(request, result);
        return this._response;
    }

    public MyWebClient ServerCertValidation(bool validate)
    {
        if (!validate) ServicePointManager.ServerCertificateValidationCallback += delegate(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors) { return true; };
        return this;
    }
}

Ecco un metodo che utilizza TCP, su cui è basato http. Restituirà quando connesso o dopo il timeout (millisecondi), quindi potrebbe essere necessario modificare il valore a seconda della situazione

var result = false;
try {
    using (var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)) {
        var asyncResult = socket.BeginConnect(yourUri.AbsoluteUri, 80, null, null);
        result = asyncResult.AsyncWaitHandle.WaitOne(100, true);
        socket.Close();
    }
}
catch { }
return result;

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow