Scraping simple javascript page

https://stackoverflow.com/questions/10044256

29-05-2021
|

Question

I would like to scrape the data of this web site ( http://www.oddsportal.com/matches/soccer ) in order to get a plain text file with the match info and the odds info in this way:

00:30   Criciuma - Atletico-PR                    1:2   2.70    3.24    2.41    
10:45   Vier-und Marschlande - Concordia Hamburg  0:0   4.00    3.53    1.68    
10:45   Germania Schnelsen - ASV Bergedorf 85     2:3   1.95    3.37    3.23    
10:45   Barmbecker SG - Altona                    0:2   3.67    3.37    1.82

I used to do this with w3m, but now it seems that they changed html to javascript and w3m does not work. Data are contained in only one div. this is one entry

<tr xeid="862487"><td class="table-time datet t1333724400-1-1-0-0 ">17:00</td><td class="name table-participant" colspan="2"><a href="/soccer/italy/serie-b-2011-2012/brescia-marmi-lanza-verona-862487/">Brescia - Verona</a></td><td class="odds-nowrp" xoid="40456791" xodd="xzc0fxzxa">-</td><td class="odds-nowrp" xoid="40456793" xodd="cz0ofxz9c">-</td><td class="odds-nowrp" xoid="40456792" xodd="cz9xfcztx">-</td><td class="center info-value">17</td></tr>

What can I do?

Solution

The easiest way (maybe not the best though) is to use selenium/watir. In ruby I would do:

require 'watir-webdriver'
require 'csv'
@browser = Watir::Browser.new
@browser.goto 'http://www.oddsportal.com/matches/soccer/'
CSV.open('out.csv', 'w') do |out|
    @browser.trs(:class => /deactivate/).each do |tr|
        out << tr.tds.map(&:text)
    end
end

OTHER TIPS

If they are using Javascript to get data from a service and render it within the DIV, W3M will not show the div updated with that data, because it does not support Javascript.

You have two choices:

Reverse-engineer their Javascript to find out where the data is coming from, and see if you can query that data source directly to get the XML or JSON they're using to update the DIV. Then you can skip the scraping entirely. They might not want you doing that, however, and may have secured the data source to prevent it. Or they might not have.
Use a browser which executes Javascript before you start your scraping. This way you'll have the div populated with the data. W3M-js might do this for you, or you might want to try something else (lynx or links). This question seems to be related.

ETA: Maybe PhantomJS would help here?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow