Pregunta

I wrote this piece of code in order to get the html code from a URL inserted by the user. I used the HtmlAgilityPack because i want to work with only specific parts of the code(body,title etc). I succeeded "downloading" the data from the website but i geuss my Xpath code is somehow incorrect. Here is the relevent code:

Dim htmlWeb As String = URL (inserted by the user)
Dim htmlDoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    htmlDoc.LoadHtml(htmlWeb)
Dim htmlText As String
    htmlDoc.OptionFixNestedTags = True
Dim myBR As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("...")
    htmlText = myBR("...").InnerText

    For Each Match_Positive_Word As Match In Regex.Matches(htmlText, Positive_Words)
        Positive_Counter = Positive_Counter + 1
    Next

    For Each Match_Negative_Word As Match In Regex.Matches(htmlText, Negative_Words)
        Negative_Counter = Negative_Counter + 1
    Next

Questions:

  • What i need to write inside the brackets in order to get, for instance, the data inside the body tag?
  • Does my connection with the html code is correct? Is there any better or more efficient way to do this?

EDIT

When i do htmlDoc.Load(htmlWeb) it gives me the error: URI formats are not supported.But when i do LoadHtml it seems to work. The main problem is in the line htmlText = myBR.InnerText. it returns the error: Object reference not set to an instance of an object. Here is what i wrote:

Dim htmlWeb As String = URL
Dim htmlDoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    htmlDoc.LoadHtml(htmlWeb)
Dim htmlText As String
    htmlDoc.OptionFixNestedTags = True
Dim myBR As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//body")
    htmlText = myBR().InnerText

Is there anything i need to add in the barackets of myBR? I tried .InnerHtml and it doesn't work either.

¿Fue útil?

Solución

Use Load() method if you have the URL, and use LoadHtml() if you have the html page. It appears that you need the first method in this case :

htmlDoc.Load(htmlWeb)

As you asked as an example, to get body tag you can use this simple XPath //body

UPDATE :

I missed the fact that HAP's HtmlDocument, unlike XDocument, can't Load directly from URL. The method only accept path to a file in local machine. To load HtmlDocument from URL you need to use HtmlWeb's Load() method instead. Try this way :

Dim htmlWeb As New HtmlWeb
Dim htmlDoc As HtmlAgilityPack.HtmlDocument = htmlWeb.Load(URL)
Dim htmlText As String
    htmlDoc.OptionFixNestedTags = True
Dim myBR As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//body")
    htmlText = myBR().InnerText
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top