Automate browser navigation and data extraction

https://stackoverflow.com/questions/1098898

11-09-2019
|

Question

I am trying to automate data extraction from a website and I really don't know where to start. One of our suppliers is giving us access to some equipment logging data through a "Business Objects 11" online application. If you are not familiar with this online app, think of it as a web based report generator. The problem is that I am trying to monitor a lot of equipment and this supplier has only created a request to extract one log at a time. This request takes the equipment number, the start date and the end date... To make matters worse, we can only export to the binary Excel format since de "csv" export is broke and they refuse to fix it... hence we are limited by Excel's 65 536 row limitation... (that amounts to 3-4 days of data recording in my case). I can't create a new resquest as only the supplier has the necessary admin rights.

What do you think would be the most elegant way of running a lot of requests (around 800) through a web GUI ? I guess I could hardcode mouse positions, click events, and keystrokes with delays and everything... But there has to be a better way.

I read about AutoHotKey and AutoIt scripting but they seem to be limited as to what they can do on the web. Also... I am stuck with IE6... But if you know a way that involves another browser, I am still very interested in your answer.

(once I have the log files locally, extracting the data is not a problem)

Solution

There are some things you might try. If the site is a html and reports can be requested by a simple POST or GET then urlib/urlib2 and cookielib Python modules should be enough to fetch an excel document.

Then you can try this: xlrd to extract data from excel.

Also, take a look at: http://pamie.sourceforge.net/. I never tried it myself but looks promising and easy to use.

OTHER TIPS

Normally, I would suggest not to use IE (or any browser) at all. Remember, web browser software are just proxy programs for making http requests and displaying the results in meaningful ways. There are other ways you can make similar http requests and process the responses. Almost every modern language has this built into it's API somewhere. This is called screen scraping or web scraping.

But to complete this suggestion I need to know more about your programming environment: ie, in what programming language do you envision writing this script?

A typical example using C# where you just get the html result as string would look like this:

new System.Net.WebClient().DownloadString("http://example.com");

You then parse the string to find any fields you need and send another request. The WebClient class also have a .DownloadFile() method that you might find useful for retrieving the excel files.

Since you can use .NET, you should consider using the Windows Forms WebBrowser control. You can automate it to navigate to the site, press buttons, etc. Once the report page is loaded, you can use code to navigate the HTML DOM to find the data you want - no regular expressions involved.

I did something like this years ago, to extract auction data from eBay.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow