Scraperwiki scrape query: using lxml to extract links
-
19-06-2021 - |
Question
I suspect this is a trivial query but hope someone can help me with a query I've got using lxml in a scraper I'm trying to build.
https://scraperwiki.com/scrapers/thisisscraper/
I'm working line-by-line through the tutorial 3 and have got so far with trying to extract the next page link. I can use cssselect to identify the link, but I can't work out how to isolate just the href attribute rather than the whole anchor tag.
Can anyone help?
def scrape_and_look_for_next_link(url):
html = scraperwiki.scrape(url)
print html
root = lxml.html.fromstring(html) #turn the HTML into lxml object
scrape_page(root)
next_link = root.cssselect('ol.pagination li a')[-1]
attribute = lxml.html.tostring(next_link)
attribute = lxml.html.fromstring(attribute)
#works up until this point
attribute = attribute.xpath('/@href')
attribute = lxml.etree.tostring(attribute)
print attribute
Solution
CSS selectors can select elements that have an href attribute with eg. a[href]
but they can not extract the attribute value by themselves.
Once you have the element from cssselect, you can use next_link.get('href')
to get the value of the attribute.
OTHER TIPS
link = link.attrib['href']
should work
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow