Scraping all links using Python BeautifulSoup/lxml

Question 1

This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .

The best option is to use html.parser to parse the url .

from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()

page = BeautifulSoup(data,'html.parser')

for link in page.findAll('a'):
       l = link.get('href')
       print l

This worked for me .Make sure to install dependencies .

Question 2

I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.

Question 3

Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest). In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.

Question 4

Use a web scraping tool such as scrapy or mechanize In mechanize, to get all the links in the snapdeal homepage,

br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
    print link.name
    print link.url

Question 5

I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.

I was able to achieve this using phantomJS, selenium and beautiful soup

#!/usr/bin/python

import bs4
import requests
from selenium import webdriver

driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
    print paths
driver.close()

Question 6

The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.

Python 2

This is inspired by this answer.

from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()

page = BeautifulSoup(data,'html.parser')

for link in page.findAll('a'):
       l = link.get('href')
       print l

Python 3

from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl

# to open up HTTPS URLs
gcontext = ssl.SSLContext()

# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()

page = BeautifulSoup(data, 'html.parser')

for link in page.findAll('a'):
    l = link.get('href')
    print(l)

Other Languages

For other languages, please see this answer.