Scrapy : xpath of a link in a table

https://stackoverflow.com/questions/21857725

13-10-2022
|

Question

I would like to extract some book links from this table using scrapy.

The table looks like this :

<table id="table_text">

<tbody>

<tr >
<td>15/02/2014</td>
<td><a href="/book_1.html">Book 1</a></td>
<td>The Author</td>
<td> <a href="/tag1">tag1</a>  <a href="/tag2">tag2</a> </td>
<td>Genre</td>
</tr>

and the extracted link should be :

/book_1.html

The selector that I used is

def parse(self, response):
    hxs = Selector(response)
    hxs = Selector(response)
    links = hxs.xpath('//table[@id="table_text"]//tr//td[2]//a//@href')

but print links shows an empty output : []

I would like to know what is wrong with the xpath that I used ?

Solution

With the information you gave, your XPath is working fine. It could be simplified with

//table[@id="table_text"]//tr/td[2]/a/@href

but your version returns the right node.

When encountering unexpected behavior with scrapy, try to always check the HTML it receives is the one that you expected. HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (and some browsers try to sanitize HTML).

That's why you should check that the content of response.body is what you expect. If it's not, you'll need to find a workaround :)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow