Question

I want to write a little program in C# that makes it possible to present some data from a website in different ways. With system tray notification and different views and so on...

The needed data is shown in the browser as normal text and can be copied with copy/paste as text to an editor. When I use tools like wget I am able to download the html source of the website but I noticed that the needed data is hidden with javascript (and ajax?).

Is there a way to download the real content of a website from script/commandline/c#/java or similar? Some kind of Javascript interpreter that resolves the data and I get the website as text output?

Any other ideas how I could extract the data?

thanks

Edit 2:

Problem solved. See answer.

Was it helpful?

Solution 2

At last... I made a PhantomJS script which exactly does what I need...

It allows to login to a site and then executes the javascripts to show content.

Additionally I added a command to generate a screenshot of the website to make it easier to debug.

Thanks to RolandKrüger and remy which are helped to get a solution.

One may have to change the script a little bit but I think it can help ;)

var page = require('webpage').create();

page.onConsoleMessage = function(msg) {
    console.log(msg);
};

page.open("http://www.somewebsite.com", function(status) {
    if ( status === "success" ) {
        page.evaluate(function() {
              document.querySelector("input[name='MAIL_ADDRESS']").value = "mymail@gmail.com";
              document.querySelector("input[name='PASSWORD']").value = "mypassword";
              document.getElementsByName("LOGIN_FORM_SUBMIT")[0].click();
              console.log("Login submitted!");
        });
        window.setTimeout(function () {
            page.render('screenshot.png');
            var ua = page.evaluate(function () {
                return document.getElementById('AnElementIdOnMyWebsite').innerText;
            });
            console.log(ua);
            phantom.exit();
        }, 5000);
   }
});

OTHER TIPS

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button, left to clear button, will preserve log.

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top