Question

I'm trying to get a short extract from Wikipedia articles. Using the following url in my browser: http://en.wikipedia.org//w/api.php?action=query&prop=extracts&format=txt&exsentences=2&exlimit=10&exintro=&explaintext=&iwurl=&titles=Greek%20language

I get the following result in my browser:

Array
(
[query] => Array
    (
        [pages] => Array
            (
                [11887] => Array
                    (
                        [pageid] => 11887
                        [ns] => 0
                        [title] => Greek language
                        [extract] => Greek (Modern Greek: ελληνικά [eliniˈka] "Greek" and ελληνική γλώσσα [eliniˈci ˈɣlosa] ( ) "Greek language") is an independent branch of the Indo-European family of languages. Native to the southern Balkans, western Asia Minor, Greece, the Aegean Islands, and Cyprus it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. 
                    )

            )

    )

)

Which is great.

The problem is, when I use the same url to try and grab it with php server-side with CURL, the foreign letters show up as gibberish. Here's how I'm trying to do that:

$url = 'http://en.wikipedia.org//w/api.php?action=query&prop=extracts&format=txt&exsentences=2&exlimit=10&exintro=&explaintext=&iwurl=&titles=Greek%20language';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); 
$c = curl_exec($ch);
echo $c;

which gives me the following result:

Array ( [query] => Array ( [pages] => Array ( [11887] => Array ( [pageid] => 11887 [ns] => 0 [title] => Greek language [extract] => Greek (Modern Greek: ελληνικά [eliniˈka] "Greek" and ελληνική γλώσσα [eliniˈci ˈɣlosa] ( ) "Greek language") is an independent branch of the Indo-European family of languages. Native to the southern Balkans, western Asia Minor, Greece, the Aegean Islands, and Cyprus it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. ) ) ) )

But the foreign words are gibberish. I get the same result with other articles about foreign languages. How can receive and present the foreign letters correctly?

Was it helpful?

Solution

You need to set the header

<?php
header('Content-Type: text/html;charset=utf-8'); //<--- Add this

That is because those characters are in Unicode , so you need to implicitly set your header to reflect the charset.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top