Extracting data from Web

https://stackoverflow.com/questions/12332847

30-06-2021
|

Question

One really newbie question. I'm working on a small python script for my home use, that will collect data of a specific air ticket.

I want to extract the data from skyscanner (using BeautifulSoap and urllib). Example:

http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html

And I'm interested in all the data that are stored in this kind of element, specially the price: http://shrani.si/f/1w/An/1caIzEzT/capture.png

Because they are not located in the HTML, can I extract them?

Solution

I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.

I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:

Rendered/interactive javascript with gtk/webkit/jswebkit

Rendered Javascript Crawler With Scrapy and Selenium RC

OTHER TIPS

I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.

I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.

A great introduction to Selenium can be found here.

Install: pip install selenium

Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!

Ex:

airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'

soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))

Now you can parse soup like you normally would with Beautifulsoup.

I hope that helps you!

from selenium import webdriver
import requests

class SeleniumWebScraper():

    def __init__(self):
        self.source_code = ''
        self.is_page_loaded = 0
        self.driver = webdriver.Firefox()
        self.is_browser_closed = 0
        # To ensure the page has fully loaded we will 'implicitly' wait 
        self.driver.implicitly_wait(10)  # Seconds

    def close(self):
        self.driver.close()
        self.clear_source_code()
        self.is_page_loaded = 0
        self.is_browser_closed = 1

    def clear_source_code(self):
        self.source_code = ''
        self.is_page_loaded = 0

    def retrieve_source_code(self, domain):
        if self.is_browser_closed:
            self.driver = webdriver.Firefox()
        # The driver.get method will navigate to a page given by the URL.
        #  WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
        #  before returning control to your test or script.
        # It's worth nothing that if your page uses a lot of AJAX on load then
        #  WebDriver may not know when it has completely loaded.
        self.driver.get(domain)

        self.is_page_loaded = 1
        self.source_code = self.driver.page_source
        return self.source_code

You don't even need BeautifulSoup to extract data.

Just do this and your response is converted to a Dictionary which is very easy to handle.

text = json.loads("You text of the main response content")

You can now print any key value pair from the dictionary. Give it a try. It is super easy.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow