Question

I'm just getting into scraping with Scraperwiki in Python. Already figured out how to scrape tables from a page, run the scraper every month and save the results on top of each other. Pretty cool.

Now I want to scrape this page with information on Android versions and run the script monthly. In particular, I want the table for the version, codename, API and distribution. It's not easy.

The table is called with a wrapper div. Is there any way to scrape this information? I can't find any solution.

Plan B is to scrape the visualisation. What I eventually need, is the codename and the percentage, so that's sufficient. This information can be found in the HTML in a Google Chart script.

Google Chart API script

But I can't find this information with my 'souped' HTML. I have a public scraper over here. You can edit it to make it work.

Can anyone explain how I can approach this problem? A working scraper with comments on what's going on would be awesome.

Was it helpful?

Solution

This is really a difficult case, because as kisamoto mentioned, the data is inside the embedded JavaScript and not in a seperate JSON file as you would expect. It is possible with BeautifulSoup but it involes some ugly string processing:

last_paragraph = soup.find_all('p', style='clear:both')[-1]
script_tag = last_paragraph.next_sibling.next_sibling
script_text = script_tag.text

lines = script_text.split('\n')
data_text = ''
for line in lines:

    if 'SCREEN_DATA' in line: break
    data_text = data_text + line

data_text = data_text.replace('var VERSION_DATA =', '')
# delete semicolon at the end
data_text = data_text[:-1]

data = json.loads(data_text)
data = data[0]
print data['data']

Output:

[{u'perc': u'0.1', u'api': 4, u'name': u'Donut'}, ... ]

OTHER TIPS

As this is stored and rendered in JavaScript, the raw Python scraper is unable to execute this code and view the visualisation or table.

ScraperWiki is great however I've always found, if you're doing a single page each month, a python script + cron is much better and, if you need to have this JavaScript parsing, using Selenium and it's python driver is a much more powerful solution.

When you have the selenium server installed you can do roughly the following (in pseudocode)

#!/bin/env python
from selenium import webdriver

browser = webdriver.Firefox() 
# Load page with all Javascript rendered in the DOM for you.
browser.get("http://developer.android.com/about/dashboards/index.html") 
# Find the table
table = browser.find_element_by_xpath("/html/body/div[3]/div[2]/div/div/div[2]/div/div/table") 
# Do something with the table element
# Save the data
browser.close()

Then just have a cron job running the script on the first day of the month like so:

0 0 1 * * /path/to/python_script.py
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top