The issue is actually with symfony/browser-kit and symfony/domcrawler. The browserkit's Client
does not examine the HTML meta tags to determine the charset, content-type header only. When the response body is handed over to the domcrawler, it is treated as the default charset ISO-8859-1. After examining the meta tags that decision should be reverted and the DomDocument rebuilt, but that never happens.
The easy workaround is to wrap $crawler->text()
with utf8_decode()
:
$text = utf8_decode($crawler->text());
This works if the input is UTF-8. I suppose for other encodings something similar can be achieved with iconv()
or so. However, you have to remember to do that every time you call text()
.
A more generic approach is to make the Domcrawler believe that it deals with UTF-8. To that end I've come up with a Guzzle plugin that overwrites (or adds) the charset in the content-type response header. You can find it at https://gist.github.com/pschultz/6554265. Usage is like this:
<?php
use Goutte\Client;
$plugin = new ForceCharsetPlugin();
$plugin->setForcedCharset('utf-8');
$client = new Client();
$client->getClient()->addSubscriber($plugin);
$crawler = $client->request('get', $url);
echo $crawler->text();