HTML <p> nodes InnerText including anchor text in CsQuery

https://stackoverflow.com/questions/22315019

csquery

12-06-2023
|

Вопрос

I'm parsing some wordpress blog articles using CsQuery to do some text clustering analysis on them. I'd like to strip out the text from the pertinent <p> node.

var content = dom["div.entry-content>p"];
if (content.Length == 1)
{
    System.Diagnostics.Debug.WriteLine(content[0].InnerHTML);
    System.Diagnostics.Debug.WriteLine(content[0].InnerText);
}

In one of the posts the InnerHTML looks like this:

An MIT Europe project that attempts to <a title="Wired News: Gizmo Puts Cards 
on the Table" href="http://www.wired.com/news/technology/0,1282,61265,00.html?
tw=rss.TEK">connect two loved ones seperated by distance</a> through the use 
of two tables, a bunch of RFID tags and a couple of projectors.

and the corresponding InnerText like this

An MIT Europe project that attempts to through the use of two tables, a bunch of RFID tags and a couple of projectors.

i.e. the inner text is missing the anchor text. I could parse the HTML myself but I am hoping there is a way to have CsQuery give me

An MIT Europe project that attempts to connect two loved ones seperated by distance through the use of two tables, a bunch of RFID tags and a couple of projectors.

(my italics.) How should I get this?

Решение

   string result = dom["div.entry-content>p"].Text();

Text function will include everything that is bellow p includes p tag.

Другие советы

Try to use HtmlAgilityPack

using HAP = HtmlAgilityPack;
...
var doc = new HAP.HtmlDocument();
doc.LoadHtml("Your html");
var node = doc.DocumentNode.SelectSingleNode(@"node xPath");
Console.WriteLine(node.InnerText());

xPath is the path to the node on the page.

For example: In Google Chrome, press F12 and select your node, right-click and select "Copy xPath"

This topic header xPath: //*[@id="question-header"]/h1/a

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow