Question

Google's finance API is incomplete -- many of the figures on a page such as:

http://www.google.com/finance?fstype=ii&q=NYSE:GE

are not available via the API.

I need this data to rank companies on Canadian stock exchanges according to the formula of Greenblatt, available via google search for "greenblatt index scans".

My question: what is the most intelligent/clean/efficient way of accessing and processing the data on these webpages. Is the tedious approach really necessary in this case, and if so, what is the best way of going about it? I'm currently learning Python for projects related to this one.

Was it helpful?

Solution

You could try asking Google to provide the missing APIs. Otherwise, you're stuck with screen scraping, which is never fun, prone to breaking without notice, and likely in violation of Google's terms of service.

But, if you still want to write a screen scraper, it's hard to beat a combination of mechanize and BeautifulSoup. BeautifulSoup is an HTML parser and mechanize is a Python-based web browser that will let you log in, store cookies, and generally navigate around like any other web browser.

OTHER TIPS

BeautifulSoup would be the preferred method of HTML parsing with Python

Have you looked into options besides Google (e.g. Yahoo Finance API)?

Scraping web pages always sucks, but I would recommend converting them to xml (via tidy or some other HTML -> XML program) and then using xpath to walk the nodes that you are interested in.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top