Come si analizza una stringa HTML per i tag immagine per ottenere le informazioni SRC?

https://stackoverflow.com/questions/138839

02-07-2019
|

Domanda

Attualmente uso .Net WebBrowser.Document.Images () per farlo. Richiede il Webrowser per caricare il documento. È disordinato e occupa risorse.

Secondo questa domanda XPath è meglio di un regex in questo.

Qualcuno sa come farlo in C #?

Soluzione

Se la tua stringa di input è XHTML valida, puoi trattarla come xml, caricarla in un xmldocument ed eseguire la magia XPath :) Ma non è sempre il caso.

Altrimenti puoi provare questa funzione, che restituirà tutti i collegamenti immagine da HtmlSource:

public List<Uri> FetchLinksFromSource(string htmlSource)
{
    List<Uri> links = new List<Uri>();
    string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
    MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
    foreach (Match m in matchesImgSrc)
    {
        string href = m.Groups[1].Value;
        links.Add(new Uri(href));
    }
    return links;
}

E puoi usarlo in questo modo:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    using(StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
    }
}

Altri suggerimenti

Il grosso problema con qualsiasi analisi HTML è il "ben formato" parte. Hai visto la merda HTML là fuori - quanto è davvero ben formato? Avevo bisogno di fare qualcosa di simile: analizzare tutti i collegamenti in un documento (e nel mio caso) aggiornarli con un collegamento riscritto. Ho trovato il Html Agility Pack sopra su CodePlex. Oscilla (e gestisce HTML non valido).

Ecco uno snippet per scorrere i collegamenti in un documento:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"C:\Sample.HTM");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a/@href");

Content match = null;

// Run only if there are links in the document.
if (linkNodes != null)
{
    foreach (HtmlNode linkNode in linkNodes)
    {
        HtmlAttribute attrib = linkNode.Attributes["href"];
        // Do whatever else you need here
    }
}

Post sul blog originale

Se tutto ciò di cui hai bisogno sono le immagini, userei solo un'espressione regolare. Qualcosa del genere dovrebbe fare il trucco:

Regex rg = new Regex(@"<img.*?src=""(.*?)""", RegexOptions.IgnoreCase);

Se è xhtml valido, puoi farlo:

XmlDocument doc = new XmlDocument();
doc.LoadXml(html);
XmlNodeList results = doc.SelectNodes("//img/@src");

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow