Question

I want to scrape a table from a website with a table that looks like this;

<table class="table table-hover data-table sort display">
        <thead>
          <tr>
            <th class="Column1">
            </th>
            <th class="Column2">
            </th>
          </tr>
        </thead>
        <tbody>
          <tr ng-repeat="item in filteredList | orderBy:columnToOrder:reverse">
            <td>{{item.Col1}}</td>
            <td>{{item.Col2}}</td>
          </tr>
        </tbody>
</table>

It seems that this website is built using some javascript framework that retrieves the table content from the backend through web services.

The problem is how can we scrape table data if the data is not in numerical format? The code above have the content enclosed in {{ }}. Does this make the website unscrapable? Any solution? Thank you.

I am using python and beautifulsoup4.

Was it helpful?

Solution

Usually when there is JS content BeautifulSoup is not the tool. I use selenium. Try this and see if the HTML you are getting is scrapable:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load

# now print the response
print driver.page_source

At this point, you can use BeautifulSoup to scrape the data out of driver.page_source. Note: you will need to install selenium and Firefox

OTHER TIPS

You could try using import.io (https://import.io) - our connectors, extractors and crawlers all support getting data from pages that is rendered with JavaScript. Without a specific URL I can't verify yours will work for certain, but I don't see why it wouldn't (looks like it is being rendered by AngularJS which should be fine).

p.s. if you hadn't figured it out, I work at import.io - drop me a line if you have specific questions.

What you could do is go to Chrome, and load the site. Go to the console and go to the 'network' tab. Tick 'preserve log' at the top. Reload site and load all the stuff in the log. Now you'll see where the data comes from for 'filteredList' on your page. So in your scraper you now also know where that data comes from, so you can include it in your scraper. The data is most likely in json format... which can be picked up and fiddled with to your hearts content....

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top