You might want to try:
//tr/th[@scope="row"][.="Title"]/following-sibling::td[1]/text()
Question
I am currently using scrapy for python and trying to retrieve information from a website with a source code similar to this:
<tr>
<th scope="row">Date</th>
<td>10/17/2001</td>
<td></td>
</tr>
<tr>
<th scope="row">Title</th>
<td>Harry Potter</td>
<td></td>
</tr>
<tr>
<th scope="row">Author</th>
<td>J.K. Rowling</td>
<td></td>
</tr>
Harry Potter is the text string I wish to retrieve. However, I cannot use the traditional Xpath delimiters since there are multiple instances of this code snippet throughout the page (only with different text between the th/td tags--as seen in the code above). All of the th tags are prefaced with a scope="row" attribute.
Additionally, I cannot scrape just the x-th instance of the tag because each webpage I am scraping has a variable number of instances.
Is there a way to obtain text (such as Harry Potter) that follows a particular string of text (such as title) in scrapy?
Solution
You might want to try:
//tr/th[@scope="row"][.="Title"]/following-sibling::td[1]/text()
OTHER TIPS
Give it a try. In General, learn to be XPath.
/tr/td/text()