Frage

http://www.snapdeal.com/

I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.

under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.

url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
       l = link.get('href')
       print l

But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)

I just want to finds all sub links from each major category. any suggestions will be appreciated.

War es hilfreich?

Lösung

This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .

The best option is to use html.parser to parse the url .

from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()

page = BeautifulSoup(data,'html.parser')

for link in page.findAll('a'):
       l = link.get('href')
       print l  

This worked for me .Make sure to install dependencies .

Andere Tipps

I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.

Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest). In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.

Use a web scraping tool such as scrapy or mechanize In mechanize, to get all the links in the snapdeal homepage,

br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
    print link.name
    print link.url

I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.

I was able to achieve this using phantomJS, selenium and beautiful soup

#!/usr/bin/python

import bs4
import requests
from selenium import webdriver

driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
    print paths
driver.close()

The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.

Python 2

This is inspired by this answer.

from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()

page = BeautifulSoup(data,'html.parser')

for link in page.findAll('a'):
       l = link.get('href')
       print l 

Python 3

from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl

# to open up HTTPS URLs
gcontext = ssl.SSLContext()

# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()

page = BeautifulSoup(data, 'html.parser')

for link in page.findAll('a'):
    l = link.get('href')
    print(l)

Other Languages

For other languages, please see this answer.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top