Unmatching results from browser and code while retrieving a page with QNetworkRequest and QNetworkAccessManager

https://stackoverflow.com/questions/12287310

30-06-2021
|

Pergunta

I'm writing a simple web spider. The idea is to get a page programmatically using QNetworkAccessManager, QNetworkReply and QNetworkRequest, everything works fine.

The problem I encounter is that (for some pages) I get different/unmatching results programmatically or by visiting "manually" the page with a browser. I always get sintactically correct HTML pages, but they look to me like some sort of "spider protection" answers. The pages I'm referring AREN'T POST pages, the tests I'm doing are with very simple url pages, sometimes with parameters (e.g. www.sample.com/index.php?param=something), sometimes even with plain page.html urls.

The pseudocode is as follows:

QNetworkRequest req;
req.setUrl(QUrl(myurl));
req.setRawHeader(*I did try this one with no success*);
QNetworkAccessManager man;
QNetworkReply rep = man->get(req);
//finish and error slots connection code here

. . .

void replyFinished()
{
    QNetworkReply* rep = qobject_cast<QNetworkReply *>(sender());
    if (rep->error() == QNetworkReply::NoError)
    {
        // read data from QNetworkReply here
        QByteArray bytes = rep->readAll();
        QString stringa(bytes); 
        qDebug() << stringa;
    }
}

In the finish() slot I'm printing the data from the networkreply and sometimes I get unmatching results from the simple "View Source" operation in the browser got by visiting by hand the url.

Sometimes I get a custom "Not found" page, sometimes some more weird pages with logins forms or other unexpected contents. Maybe it's some kind of spider protection ? Can anyone help ?

Solução

There are 3 main methods protecting from webspiders:

Web browser identification - using message headers the website is seeing the difference between browser and web-crawlers. You write that you used raw headers - are you sure that you provide the same headers and values your browser does?
Session data/coockies - closely related to previous ones. Login forms suggest that website is expecting to get some informations that browser would normally send.
Javascript code printing actual html data into web document. Are you checking if you get the same html cody by checking source of a website in your web browser (view->source), or are you checking html layout by some tool like Firebug?
Javascript redirecting - browser is downloading website that is using javascript to redirect you to the website with actual content.

As far as the first two options go - you should use some tcp/ip sniffer like Smart sniff to check if data sent by browser are equal to those sent by your program. If it's equal, that means that you are probably hitting some sort of javascript-barrier. If so, you might try to use some javascript-enabled browsing engine like QWebPage. I don't know if it's executing it's javascript when not connected to any QWebView though - perhaps a hidden view might be necesary.

If I find myself in a situation that I need to impersonate browser to some remote service, I usually simply write Firefox-plugin (using javascript); that usualy eliminates any of above problems ;)

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow