Question

I'm struggling with the automated data collection of a PHP script from a webserver. The files in question contain meteo data and are updated every 10 minutes. Weirdly enough, the 'file modified' date on the webserver doesn't change.

A simple fopen('http://...')-command tries to get the freshest version of the last file in this directory every hour. But regularly I end up with a version up to 4 hours old. This happens on a Linux server which (As my system administrator has assured me) doesn't use a proxy server of any kind.

Does PHP implement its own caching mechanism? Or what else could be interfering here?

(My current workaround is to grab the file via exec('wget --nocache...') which works.)

Was it helpful?

Solution

Since you're getting the file via HTTP, I'm assuming that PHP will be honouring any cache headers the server is responding with.

A very simple and dirty way to avoid that is to append some random get parameter to each request.

OTHER TIPS

The Q related to observed caching of content accessed by a fopen('http://...') and the poster wondered whether PHP implement its own caching mechanism? The other answers included some speculation, but surely the easiest way to find out is to check by looking at the source code or perhaps easier instrumenting the system calls to see what is going on? This is trivial to do on Debian systems as follows:

$ echo "Hello World" > /var/www/xx.txt
$ strace -tt -o /tmp/strace  \
> php -r 'echo file_get_contents("http://localhost/xx.txt");'
Hello World

I've included the relevant extract of the strace log below but what this shows is the the PHP RTS simply connects to localhost:80, sends a "GET /xx.txt", gets a response comprising headers and file content which it then echoes to STDOUT.

Absolutely no client-side caching occurs within the PHP RTS, and since this is doing direct HTTP socket dialogue, it is hard to envision where caching could occur on the client. We are left with the possibility of server-side or intermediate proxy caching. (Note I default to an expires of Access + 7 days on txt files).

Logfile Extract

00:15:41.887904 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
00:15:41.888029 fcntl(3, F_GETFL)       = 0x2 (flags O_RDWR)
00:15:41.888148 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
00:15:41.888265 connect(3, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
00:15:41.888487 poll([{fd=3, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 60000) = 1 ([{fd=3, revents=POLLOUT}])
00:15:41.888651 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
00:15:41.888838 fcntl(3, F_SETFL, O_RDWR) = 0
00:15:41.888975 sendto(3, "GET /xx.txt HTTP/1.0\r\n", 22, MSG_DONTWAIT, NULL, 0) = 22
00:15:41.889172 sendto(3, "Host: localhost\r\n", 17, MSG_DONTWAIT, NULL, 0) = 17
00:15:41.889307 sendto(3, "\r\n", 2, MSG_DONTWAIT, NULL, 0) = 2
00:15:41.889437 poll([{fd=3, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 0) = 0 (Timeout)
00:15:41.889544 poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 1 ([{fd=3, revents=POLLIN}])
00:15:41.891066 recvfrom(3, "HTTP/1.1 200 OK\r\nDate: Wed, 15 F"..., 8192, MSG_DONTWAIT, NULL, NULL) = 285
00:15:41.891235 poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 1 ([{fd=3, revents=POLLIN}])
00:15:41.908909 recvfrom(3, "", 8192, MSG_DONTWAIT, NULL, NULL) = 0
00:15:41.909016 poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 60000) = 1 ([{fd=3, revents=POLLIN}])
00:15:41.909108 recvfrom(3, "", 8192, MSG_DONTWAIT, NULL, NULL) = 0
00:15:41.909198 close(3)                = 0
00:15:41.909323 write(1, "Hello World\n", 12) = 12
00:15:41.909532 munmap(0x7ff3866c9000, 528384) = 0
00:15:41.909600 close(2)                = 0
00:15:41.909648 close(1)                = 0

So if I'm understanding you correctly, part of the problem might be that the *.dat file always has a timestamp of 1:00 AM? Do you have control of the server containing the data (http://www.iac.ethz.ch/php/chn_meteo_roof/)? If so, you should try to find out why the data always has the same timestamp. I have to believe it is being intentionally set--the OS will update the timestamp when the file is modified unless you go out of your way to make it not do so. If you can't figure out why it is being set to 1AM, you could at least do a "touch" command on the file, which will update it's modified timestamp.

This is all, of course, assuming you have some access to the server providing the files.

why dont try using curl, I think this is a more proper use for this.

maybe this can resolve your problem (POST request can't be cached as far i know)

$opts = array('http' =>
  array(
    'method'  => 'POST',
    'content'=>''
  )
);
$context  = stream_context_create($opts);
$resource = fopen ('http://example.com/your-ulr', 'r', false, $context);

/* or you can use file_get_contents to retrieve all the file 
   $fileContent = file_get_contents('http://example.com/your-ulr', false, $context);
*/
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top