Question

I have been running my simple dom script on a variety of pages for weeks, and never have I come across any issues. Now, today, when I try:

$html = file_get_html('http://www.sony.co.za/product/dsc-wx10');

I get:

( ! ) Warning: file_get_contents(http://www.sony.co.za/product/dsc-wx10) 
[function.file-get-contents]: failed to open stream: HTTP request failed!
 in C:\XXXXXXX\simplephpdom\simple_html_dom.php on line 70

What could possibly cause me to not be able to enter the code above with success, when the following works:

 $html = file_get_html('http://www.google.com');
 $html = file_get_html('http://www.whatever.com');

I am able to access the Sony page via my browser. And as far as I understand, the code above connects to port 80, just like I do. So I find it hard to believe I'm being blocked. And also, I was blocked from Day 1.

Any ideas what could be causing this?

Was it helpful?

Solution

The site seems to delay requests containing the PHP user agent forever. Sounds like a really, really lame attempt to stop crawlers.

The solution is simple: Use curl to send the request and specify a "normal" useragent.


Update: Apparently it also blocks empty/missing user agents:

> nc www.sony.co.za 80
nc: using stream socket
GET / HTTP/1.0
Host: www.sony.co.za
User-Agent: Mozilla Firefox

HTTP/1.0 301 Moved Permanently
...

vs

> nc www.sony.co.za 80
nc: using stream socket
GET / HTTP/1.0
Host: www.sony.co.za
[no response]

OTHER TIPS

I can see you are using simple_html_dom ( http://simplehtmldom.sourceforge.net/ ) ... instead of using file_get_html you can use str_get_html with curl

include 'simple_html_dom.php';
$url="http://www.sony.co.za/product/dsc-wx10";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9");
$exec = curl_exec ($ch);
$html = str_get_html($exec);
var_dump($html);

You need to set the user-agent (header), then it works:

$options = array(
    'http' => array(
            'user_agent' => 'Mozilla Firefox'
    )
);
$context = stream_context_create($options);
$url = 'http://www.sony.co.za/product/dsc-wx10';
$str = file_get_contents($url, 0, $context);
$html = str_get_html($str);

Simple HTML DOM requires here that you do the work for it (loading the string from the remote server), I'd generally say you should take DOMDocument instead of that "simple" HTML DOM library because it's better integrated and more powerful (and actually works):

$options = array(
    'http' => array(
            'user_agent' => 'Mozilla Firefox'
    )
);
$context = stream_context_create($options);
libxml_set_streams_context($context);
$url = 'http://www.sony.co.za/product/dsc-wx10';
$doc = DOMDocument::loadHTMLFile($url);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top