Scrape a Google Chart script with Scraperwiki (Python)

Question 1

This is really a difficult case, because as kisamoto mentioned, the data is inside the embedded JavaScript and not in a seperate JSON file as you would expect. It is possible with BeautifulSoup but it involes some ugly string processing:

last_paragraph = soup.find_all('p', style='clear:both')[-1]
script_tag = last_paragraph.next_sibling.next_sibling
script_text = script_tag.text

lines = script_text.split('\n')
data_text = ''
for line in lines:

    if 'SCREEN_DATA' in line: break
    data_text = data_text + line

data_text = data_text.replace('var VERSION_DATA =', '')
# delete semicolon at the end
data_text = data_text[:-1]

data = json.loads(data_text)
data = data[0]
print data['data']

Output:

[{u'perc': u'0.1', u'api': 4, u'name': u'Donut'}, ... ]

Question 2

As this is stored and rendered in JavaScript, the raw Python scraper is unable to execute this code and view the visualisation or table.

ScraperWiki is great however I've always found, if you're doing a single page each month, a python script + cron is much better and, if you need to have this JavaScript parsing, using Selenium and it's python driver is a much more powerful solution.

When you have the selenium server installed you can do roughly the following (in pseudocode)

#!/bin/env python
from selenium import webdriver

browser = webdriver.Firefox() 
# Load page with all Javascript rendered in the DOM for you.
browser.get("http://developer.android.com/about/dashboards/index.html") 
# Find the table
table = browser.find_element_by_xpath("/html/body/div[3]/div[2]/div/div/div[2]/div/div/table") 
# Do something with the table element
# Save the data
browser.close()

Then just have a cron job running the script on the first day of the month like so:

0 0 1 * * /path/to/python_script.py