Question

I´m trying to use a headless browser for crawling purposes to add SEO features in a open source project i´m developing.

The project sample site is deployed via Azure Websites.

I tried several ways to get the task working using different solutions like Selenium .NET (PhantomJSDriver, HTMLUnitDriver, ...) or even standalone PhantomJs .exe file.

I´m using a headless browser because the site is based in DurandalJS, so it needs to execute scripts and wait for a condition to be true in order to return the generated HTML. For this reason, can´t use things like WebClient/WebResponse classes or HTMLAgilityPack which use to work just fine for non-javascript sites.

All the above methods works in my devbox localhost environment but the problem comes when uploading the site to Azure Websites. When using standalone phantomjs the site freezes when accessing the url endpoint and after a while return a HTTP 502 error. In case of using Selenium Webdriver i´m getting a

OpenQA.Selenium.WebDriverException: Unexpected error. System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it 127.0.0.1:XXXX

I think the problem is with running .exe files in Azure and not with the code. I know it´s possible to run .exe files in Azure CloudServices via WebRole/WebWorkers but need to stay in Azure Websites for keep things simple.

It´s possible to run a headless browser in Azure Websites? Anyone have experience with this type of situation?

My code for the standalone PhantomJS solution is

//ASP MVC ActionResult

public ActionResult GetHTML(string url)
{
    string appRoot = Server.MapPath("~/");

    var startInfo = new ProcessStartInfo
    {
        Arguments = String.Format("{0} {1}", Path.Combine(appRoot, "Scripts\\seo\\renderHTML.js"), url),
        FileName = Path.Combine(appRoot, "bin\\phantomjs.exe"),
        UseShellExecute = false,
        CreateNoWindow = true,
        RedirectStandardOutput = true,
        RedirectStandardError = true,
        RedirectStandardInput = true,
        StandardOutputEncoding = System.Text.Encoding.UTF8
    };
    var p = new Process();
    p.StartInfo = startInfo;
    p.Start();
    string output = p.StandardOutput.ReadToEnd();
    p.WaitForExit();
    ViewData["result"] = output;
    return View();
}

// PhantomJS script

var resourceWait = 300,
    maxRenderWait = 10000;

var page = require('webpage').create(),
    system = require('system'),
    count = 0,
    forcedRenderTimeout,
    renderTimeout;

page.viewportSize = { width: 1280, height: 1024 };

function doRender() {
    console.log(page.content);
    phantom.exit();
}

page.onResourceRequested = function (req) {
    count += 1;
    //console.log('> ' + req.id + ' - ' + req.url);
    clearTimeout(renderTimeout);
};

page.onResourceReceived = function (res) {
    if (!res.stage || res.stage === 'end') {
        count -= 1;
        //console.log(res.id + ' ' + res.status + ' - ' + res.url);
        if (count === 0) {
            renderTimeout = setTimeout(doRender, resourceWait);
        }
    }
};

page.open(system.args[1], function (status) {
    if (status !== "success") {
        //console.log('Unable to load url');
        phantom.exit();
    } else {
        forcedRenderTimeout = setTimeout(function () {
            //console.log(count);
            doRender();
        }, maxRenderWait);
    }
});

and for the Selenium option

public ActionResult GetHTML(string url)
{
    using (IWebDriver driver = new PhantomJSDriver())
    {
        driver.Navigate().GoToUrl(url);

        WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(30));

        IWebElement myDynamicElement = wait.Until<IWebElement>((d) =>
        {
            return d.FindElement(By.CssSelector("#compositionComplete"));
        });

        var content = driver.PageSource;

        driver.Quit();

        return Content(content);
    }                      
}

Thanks!!

Was it helpful?

Solution

You cannot execute exe files in the shared website environment, either you have to use the web services or you have to set up a proper (azure) virtual machine.

The free shared website service is really basic, and won't cut it when you need more advanced functionality.

See this question and accepted answer for a more elaborated answer: Can we run windowservice or EXE in Azure website or in Virtual Machine?

OTHER TIPS

I am not sure about shared and basic website environment but i am successfully run ffmpeg.exe from standart website environment. Despite that still phantomjs and even chromedriver itself is not working. However i am able run Firefox driver successfully. In order to do that

I copied latest firefox directory from my local to website and below code worked well.

var binary = new FirefoxBinary("/websitefolder/blabla/firefox.exe");
var driver = new FirefoxDriver(binary, new FirefoxProfile());
driver.Navigate().GoToUrl("http://www.google.com");
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top