Scrape specific text within html tags that occur after (outside) of other tags

https://stackoverflow.com/questions/21507880

06-10-2022
|

Question

I am currently using scrapy for python and trying to retrieve information from a website with a source code similar to this:

    <tr>
    <th scope="row">Date</th>
    <td>10/17/2001</td>
    <td></td>
    </tr>
    <tr>
    <th scope="row">Title</th>
    <td>Harry Potter</td>
    <td></td>
    </tr>
    <tr>
    <th scope="row">Author</th>
    <td>J.K. Rowling</td>
    <td></td>
    </tr>

Harry Potter is the text string I wish to retrieve. However, I cannot use the traditional Xpath delimiters since there are multiple instances of this code snippet throughout the page (only with different text between the th/td tags--as seen in the code above). All of the th tags are prefaced with a scope="row" attribute.

Additionally, I cannot scrape just the x-th instance of the tag because each webpage I am scraping has a variable number of instances.

Is there a way to obtain text (such as Harry Potter) that follows a particular string of text (such as title) in scrapy?

Solution

You might want to try:

//tr/th[@scope="row"][.="Title"]/following-sibling::td[1]/text()

OTHER TIPS

Give it a try. In General, learn to be XPath.

/tr/td/text()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow