Question

I have bunch of files that are not in UTF-8 encoding and I'm converting a site to UTF-8 encoding.

I'm using simple script for files that I want to save in utf-8, but the files are saved in old encoding:

header('Content-type: text/html; charset=utf-8');
mb_internal_encoding('UTF-8');
$fpath="folder";
$d=dir($fpath);
while (False !== ($a = $d->read()))
 {

 if ($a != '.' and $a != '..')
  {

  $npath=$fpath.'/'.$a;

  $data=file_get_contents($npath);

  file_put_contents('tempfolder/'.$a, $data);

  }

 }

How can I save files in utf-8 encoding?

Was it helpful?

Solution

file_get_contents / file_put_contents will not magically convert encoding.

You have to convert the string explicitly; for example with iconv() or mb_convert_encoding().

Try this:

$data = file_get_contents($npath);
$data = mb_convert_encoding($data, 'UTF-8', 'OLD-ENCODING');
file_put_contents('tempfolder/'.$a, $data);

Or alternatively, with PHP's stream filters:

$fd = fopen($file, 'r');
stream_filter_append($fd, 'convert.iconv.UTF-8/OLD-ENCODING');
stream_copy_to_stream($fd, fopen($output, 'w'));

OTHER TIPS

Add BOM: UTF-8

file_put_contents($myFile, "\xEF\xBB\xBF".  $content); 
<?php
function writeUTF8File($filename,$content) { 
        $f=fopen($filename,"w"); 
        # Now UTF-8 - Add byte order mark 
        fwrite($f, pack("CCC",0xef,0xbb,0xbf)); 
        fwrite($f,$content); 
        fclose($f); 
} 
?>

Iconv to the rescue.

On Unix/Linux a simple shell command could be used alternatively to convert all files from a given directory:

 recode L1..UTF8 dir/*

Could be started via PHPs exec() as well.

//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));

I got this line from Cool

If you want to use recode recursively, and filter for type, try this:

find . -name "*.html" -exec recode L1..UTF8 {} \;

This works for me. :)

$f=fopen($filename,"w"); 
# Now UTF-8 - Add byte order mark 
fwrite($f, pack("CCC",0xef,0xbb,0xbf)); 
fwrite($f,$content); 
fclose($f); 

This is quite useful question. I think that my solution on Windows 10 PHP7 is rather useful for people who have yet some UTF-8 conversion trouble.

Here are my steps. The PHP script calling the following function, here named utfsave.php must have UTF-8 encoding itself, this can be easily done by conversion on UltraEdit.

In utfsave.php, we define a function calling PHP fopen($filename, "wb"), ie, it's opened in both w write mode, and especially with b in binary mode.

<?php
//
//  UTF-8 编码:
//
// fnc001: save string as a file in UTF-8:
// The resulting file is UTF-8 only if $strContent is,
// with French accents, chinese ideograms, etc..
//
function entSaveAsUtf8($strContent, $filename) {
  $fp = fopen($filename, "wb"); 
  fwrite($fp, $strContent);
  fclose($fp);
  return True;
}

//
// 0. write UTF-8 string in fly into UTF-8 file:
//
$strContent = "My string contains UTF-8 chars ie 鱼肉酒菜 for un été en France";

$filename = "utf8text.txt";

entSaveAsUtf8($strContent, $filename);


//
// 2. convert CP936 ANSI/OEM - chinese simplified GBK file into UTF-8 file:
//
$strContent = file_get_contents("cp936gbktext.txt");
$strContent = mb_convert_encoding($strContent, "UTF-8", "CP936");


$filename = "utf8text2.txt";

entSaveAsUtf8($strContent, $filename);

?>

The source file cp936gbktext.txt file content:

>>Get-Content cp936gbktext.txt
My string contains UTF-8 chars ie 鱼肉酒菜 for un été en France 936 (ANSI/OEM - chinois simplifié GBK)

Running utf8save.php on Windows 10 PHP, thus created utf8text.txt, utf8text2.txt files will be automatically saved in UTF-8 format.

With this method, BOM char is not required. BOM solution is bad because it causes troubles when we do sourcing an sql file for MySQL for example.

It's worth noting that I failed making work file_put_contents($filename, utf8_encode($mystring)); for this purpose.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

If you don't know the encoding of the source file, you can list encodings with PHP:

print_r(mb_list_encodings());

This gives a list like this:

Array
(
  [0] => pass
  [1] => wchar
  [2] => byte2be
  [3] => byte2le
  [4] => byte4be
  [5] => byte4le
  [6] => BASE64
  [7] => UUENCODE
  [8] => HTML-ENTITIES
  [9] => Quoted-Printable
  [10] => 7bit
  [11] => 8bit
  [12] => UCS-4
  [13] => UCS-4BE
  [14] => UCS-4LE
  [15] => UCS-2
  [16] => UCS-2BE
  [17] => UCS-2LE
  [18] => UTF-32
  [19] => UTF-32BE
  [20] => UTF-32LE
  [21] => UTF-16
  [22] => UTF-16BE
  [23] => UTF-16LE
  [24] => UTF-8
  [25] => UTF-7
  [26] => UTF7-IMAP
  [27] => ASCII
  [28] => EUC-JP
  [29] => SJIS
  [30] => eucJP-win
  [31] => EUC-JP-2004
  [32] => SJIS-win
  [33] => SJIS-Mobile#DOCOMO
  [34] => SJIS-Mobile#KDDI
  [35] => SJIS-Mobile#SOFTBANK
  [36] => SJIS-mac
  [37] => SJIS-2004
  [38] => UTF-8-Mobile#DOCOMO
  [39] => UTF-8-Mobile#KDDI-A
  [40] => UTF-8-Mobile#KDDI-B
  [41] => UTF-8-Mobile#SOFTBANK
  [42] => CP932
  [43] => CP51932
  [44] => JIS
  [45] => ISO-2022-JP
  [46] => ISO-2022-JP-MS
  [47] => GB18030
  [48] => Windows-1252
  [49] => Windows-1254
  [50] => ISO-8859-1
  [51] => ISO-8859-2
  [52] => ISO-8859-3
  [53] => ISO-8859-4
  [54] => ISO-8859-5
  [55] => ISO-8859-6
  [56] => ISO-8859-7
  [57] => ISO-8859-8
  [58] => ISO-8859-9
  [59] => ISO-8859-10
  [60] => ISO-8859-13
  [61] => ISO-8859-14
  [62] => ISO-8859-15
  [63] => ISO-8859-16
  [64] => EUC-CN
  [65] => CP936
  [66] => HZ
  [67] => EUC-TW
  [68] => BIG-5
  [69] => CP950
  [70] => EUC-KR
  [71] => UHC
  [72] => ISO-2022-KR
  [73] => Windows-1251
  [74] => CP866
  [75] => KOI8-R
  [76] => KOI8-U
  [77] => ArmSCII-8
  [78] => CP850
  [79] => JIS-ms
  [80] => ISO-2022-JP-2004
  [81] => ISO-2022-JP-MOBILE#KDDI
  [82] => CP50220
  [83] => CP50220raw
  [84] => CP50221
  [85] => CP50222
)

If you cannot guess, you try one by one, as mb_detect_encoding() cannot do the job easily.

I put all together and got easy way to convert ANSI text files to "UTF-8 No Mark":

function filesToUTF8($searchdir,$convdir,$filetypes) {
  $get_files = glob($searchdir.'*{'.$filetypes.'}', GLOB_BRACE);
  foreach($get_files as $file) {
    $expl_path = explode('/',$file);
    $filename = end($expl_path);
    $get_file_content = file_get_contents($file);
    $new_file_content = iconv(mb_detect_encoding($get_file_content, mb_detect_order(), true), "UTF-8", $get_file_content);
    $put_new_file = file_put_contents($convdir.$filename,$new_file_content);
  }
}

Usage: filesToUTF8('C:/Temp/','C:/Temp/conv_files/','php,txt');

  1. Open your files in windows notebook
  2. Change the encoding to be an UTF-8 encoding
  3. Save your file
  4. Try again! :O)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top