Question

I am trying to extract numeric data from a website. I tried using a simple web scraper to retrieve the data:

 from mechanize import Browser
 from bs4 import BeautifulSoup

 mech = Browser()
 url = "http://www.oanda.com/currency/live-exchange-rates/"
 page = mech.open(url)
 html = page.read()
 soup = BeautifulSoup(html)

 data1 = soup.find(id='EUR_USD-b-int')

 print data1

This kind of approach normally would give the line of data from the website including the contents of the element I am trying to extract. However it gives everything but the contents which is the part I need. I have tried .contents and it returns []. I've also tried .child and it returns 'none'. Does anyone know another method that could work. I have looked through the beautiful soup documentation but I can't seem to find a solution?

Was it helpful?

Solution

The value on this page is updated using Javascript by making a request to

GET http://www.oanda.com/lfr/rates_lrrr?tstamp=1392757175089&lrrr_inverts=1
Referer: http://www.oanda.com/currency/live-exchange-rates/

(Be aware that I was blocked 4 times just looking at this, they are extremely block-happy. This is because they sell this data commercially as a subscription service.)

The request is made and the response parsed in http://www.oanda.com/jslib/wl/lrrr/liverates.js. The response is "encrypted" with RC4 (http://en.wikipedia.org/wiki/RC4)

The RC4 decrypt method is coming from http://www.oanda.com/wandacache/rc4-ea63ca8c97e3cbcd75f72603d4e99df48eb46f66.js. It looks like this file is refreshed often so you'll need to grab the latest link from the homepage and extract the var key=<value> to fully decrypt the value.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top