I have ASP.NET MVC4 project where try to parse html document with HtmlAgilityPack. I have the following HTML:

<td class="pl22">
  <p class='pb10 pt10 t_grey'>Experience:</p>
  <p class='bold'>any</p>
</td>
<td class='pb10 pl20'>
  <p class='t_grey pb10 pt10'>Education:</p>
  <p class='bold'>any</p>
</td>
<td class='pb10 pl20'>
  <p class='pb10 pt10 t_grey'>Schedule:</p>
  <p class='bold'>part-time</p>
  <p class='text_12'>2/2 (day/night)</p>
</td>

I need to get values:

  1. "any" after "Experience:"
  2. "any" after "Education:"
  3. "part-time", "2/2 (day/night)" after "Schedule:"

All what I imagine is that

HtmlNode experience = hd.DocumentNode.SelectSingleNode("//td[@class='pl22']//p[@class='bold']");

But it get me different element, which place in the top of the page. My Experience, Education and Schedule is static values. In additional my any, any part-time day/night is the dynamic values. Can anybody help me?

有帮助吗?

解决方案

Below is an alternative which is more focused on the table headers (Experience, Education and Schedule), instead of the node classes:

private static List<string> GetValues(HtmlDocument doc, string header) {
    return doc.DocumentNode.SelectNodes(string.Format("//p[contains(text(), '{0}')]/following-sibling::p", header)).Select(x => x.InnerText).ToList();
}

You can call that method like this:

var experiences = GetValues(doc, "Experience");
var educations = GetValues(doc, "Education");
var schedules = GetValues(doc, "Schedule");

experiences.ForEach(Console.WriteLine);
educations.ForEach(Console.WriteLine);
schedules.ForEach(Console.WriteLine);

其他提示

You could do it something like this if you want to keep the XPath

var html = "<td class='pl22'><p class='pb10 pt10 t_grey'>Experience:</p><p class='bold'>any</p></td><td class='pb10 pl20'><p class='t_grey pb10 pt10'>Education:</p><p class='bold'>any</p></td><td class='pb10 pl20'><p class='pb10 pt10 t_grey'>Schedule:</p><p class='bold'>part-time</p><p class='text_12'>2/2 (day/night)</p></td> ";

var doc = new HtmlDocument
{
     OptionDefaultStreamEncoding = Encoding.UTF8
};

doc.LoadHtml(html);

var part1 = doc.DocumentNode.SelectSingleNode("//td[@class='pl22']/p[@class='bold']");
var part2 = doc.DocumentNode.SelectNodes("//td[@class='pb10 pl20']/p[@class='bold']");

foreach (var item in part2)
{
    Console.WriteLine(item.InnerText);
}

var part3 = doc.DocumentNode.SelectSingleNode("//td[@class='pb10 pl20']/p[@class='text_12']");

Console.WriteLine(part1.InnerText);            
Console.WriteLine(part3.InnerText);

Output :

any
part-time
any
2/2 (day/night)
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top