Pergunta

I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:

<div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
    ...
</div>

So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:

pages = [l['href'] for link in soup.find_all('div', class_='pagination')
     for l in link.find_all('a') if not re.search('pageSub', l['href'])]

s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])

The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?

Foi útil?

Solução

What about this:

from bs4 import BeautifulSoup

html = """ <div class='pagination'>
    <a href='webpage-category/page/1'>1</a>
    <a href='webpage-category/page/2'>2</a>
</div>"""

soup = BeautifulSoup(html)

link = soup.find('div', {'class': 'pagination'}).find('a')['href']

print '/'.join(link.split('/')[:-1])

prints:

webpage-category/page

Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:

s = next(l['href'] for link in soup.find_all('div', class_='pagination')
         for l in link.find_all('a') if not re.search('pageSub', l['href']))

UPD (using the website link provided):

import urllib2
from bs4 import BeautifulSoup


url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))

links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')

print next('/'.join(link['href'].split('/')[:-1]) for link in links 
           if link.text.isdigit() and link.text != "1")
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top