Question

There is a reports website which content I want to parse in C#. I tried downloading the html with WebClient but then I don't get the complete source since most of it is generated via js when I visit the website.

I tried using WebBrowser but could't get it to work in a console app, even after using Application.Run() and SetApartmentState(ApartmentState.STA).

Is there another way to access this generated html? I also took a look into mshtml but couldn't figure it out.

Thanks

Was it helpful?

Solution

The Javascript is executed by the browser. If your console app gets the JS, then it is working as expected, and what you really need is for your console app to execute the JS code that was downloaded.

OTHER TIPS

You can use a headless browser - XBrowser may server.

If not, try HtmlUnit as described in this blog post.

Just a comment here. There shouldn't be any difference between performing an HTTP request with some C# code and the request generated by a browser. If the target web page is getting confused and not generating the correct markup because it can't make heads or tails of from the type of browser it thinks it's serving then maybe all you have to do is set the user agent like so:

((HttpWebRequest)myWebClientRequest).UserAgent = "<a valid user agent>";

For example, my current user agent is:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1

Maybe once you do that the page will work correctly. There may be other factors at work here, such as the referrer and so on, but I would try this first and see if it works.

Your best bet is to abandon the console app route and build a Windows Forms application. In that case the WebBrowser will work without any work needed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top