Question

Below you find an excerpt of code used to screen scrape an economic calendar. The HTML page that it parses using XPath includes this row as the first rown in a table. (Only pasted this row instead of the entire HTML page.)

<tr class="calendar_row newday singleevent" data-eventid="42064"> <td class="date"><div class="date">Sun<div>Dec 23</div></div></td> <td class="time">All Day</td> <td class="currency">JPY</td> <td class="impact"> <div title="Non-Economic" class="holiday"></div> </td> <td class="event"><div>Bank Holiday</div></td> <td class="detail"><a class="calendar_detail level1" data-level="1"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td> </tr>

This code that selects the first tr row using XPath:

var doc = new HtmlDocument();
doc.Load(new StringReader(html));
var rows = doc.DocumentNode.SelectNodes("//tr[@class=\"calendar_row\"]");
var rowHtml = rows[0].InnerHtml;

The problem is that rowHtml returns this:

<td class="date"></td> <td class="time">All Day</td> <td class="currency">EUR</td> <td class="impact">  <div title="Non-Economic" class="holiday"></div>  </td> <td class="event"> <div>French Bank Holiday</div> </td> <td class="detail"><a class="calendar_detail level2" data-level="2"></a></td> <td class="actual"> </td> <td class="forecast"></td> <td class="previous"></td> <td class="graph"></td>

Now you can see that the contents of the td column for the date vanished! Why?

I've experimented many things and stumped as to why it drops the contents of that column. The other columns have content that it keeps. So what's wrong with the date column?

Is there some kind of setting or property somewhere to cause or prevent dropping contents?

Even if you haven't got a clue what's wrong but have some suggestions of a way to investigate it more.

Was it helpful?

Solution

Like @AlexeiLevenkov mentioned, you must be selecting a different row than what you want. You've pruned too much of essential problem away in an effort to simplify, but it's still clear what's wrong...

Consider that your input document might basically look like this:

<?xml version="1.0" encoding="UTF-8"?>
<table>
  <tr class="calendar_row" data-eventid="12345">
    <td>This IS NOT the tr you're looking for</td>
  </tr>
  <tr class="calendar_row newday singleevent" data-eventid="42064">
    <td>This IS the tr you're looking for</td>
  </tr>
</table>

The test @class="calendar_row" won't match against the tr you show, but it will match against the first row.

You could change your test to be contains(@class,'calendar_row') instead, but that would match both rows. You're going to have to identify some content or attribute that's unique to the row you desire. Perhaps the @data-eventid attribute would work -- can't tell without seeing your whole input file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top