Importing multibyte characters from CSV files created using Mozilla Thunderbird into PHP

https://stackoverflow.com/questions/11536796

21-06-2021
|

Question

I am trying to import a CSV file into my PHP application built with Drupal. I have encountered a strange situation when importing CSV files exported from Mozilla Thunderbird (I am exporting the address book of contacts). If the I export using the Windows version of Thunderbird, any multibyte characters are not rendered to the screen, and appear as missing characters when dumping the contents of the extracted contents to screen. However, this problem does not exist when using an identical file created using the Linux Version of Thunderbird. In this case eveything works perfectly.

To test this I have installed the same version of Thunderbird on Linux and Windows 7. I then create the same single user (surname: 张, given name: 利) in the address book, then export the address book as a CSV file. As mentions above the linux CSV file works imports successfully but the Windows one doesn't.

If I examine both files in linux using file --mime myfilename.csv is get the following output:

LinuxTB14.csv: text/plain; charset=utf-8

WinTB14.csv: text/plain; charset=iso-8859-1

So the windows file, even though it contains Chinese characters, is being encoded as iso-8859-1. After discovering this, I assumed that it is an encoding issue and that I just need to tell PHP to encode the offending content as UTF-8.

Problem is that PHP appears to be detecting the encoding in another way that I can't understand.

// Set correct locale to avoid any issues with multibyte characters.
$original_local_value = setlocale(LC_CTYPE, 0);
if ($original_local_value !== 'en_US.UTF-8') {
  setlocale(LC_CTYPE, 'en_US.UTF-8');
} 
$handle = fopen($file->uri, "r");
$cardinfo = array();
while (($data = fgetcsv($handle, 5000, ",")) !== FALSE) {
  $cardinfo[] = $data;
  // dsm() is a drupal function which prints the content of the argument to screen.
  dsm(mb_detect_encoding($data[0])); 
  dsm($data[0]);
}

If I include the above code, which shows the encoding and content of the first value in each line of the CSV file, I get the following rendered to the screen:

For the CSV created by Thunderbird in windows

ASCII

First Name

UTF-8

For the CSV create by Thunderbird in Linux

ASCII

First Name

UTF-8

利

As you can see PHP is reporting the same encoding for both files, even though the Chinese character in the Windows file is not being printed to screen.

Anyone have any ideas what might be going on here?

EDIT

If I open the Windows CSV file in notepad and save as.. UTF-8 format, then the file will import correctly. So it is obviously an encoding issue. I have added the following code to convert the file encoding if it is not already set to UTF-8.

  $file_contents = file_get_contents($file->uri);
  $file_encoding = mb_detect_encoding($file_contents, 'UTF-8, ISO-8859-1, WINDOWS-1252');
  if ($file_encoding  !== 'UTF-8') {
    $file_contents = iconv($file_encoding, 'UTF-8', $file_contents);
    $handle = fopen($file->uri, 'w');
    fwrite($handle, $file_contents);
    fclose($handle);
  }

This partially fixes the problem. The characters are appearing, but they are garbled (e.g. 张 appears as ÕÅ). I checked the page encoding of my browser and the page headers and both are set to UTF-8, so it is not a browser issue.

Any ideas?

Solution

The only solution I have come up with for this issue to not try to detect and convert the encoding of the uploaded file in the first place. After much research it appears that reliable encoding detection is not really existent. There is just too much room for error in doing this.

The safest option is to ensure that the uploaded file is encoded in UTF-8, as UTF-8 encoding can be reliably detected. The following code is how I am doing the UTF-8 encoding detection.

$file_content = file_get_contents($file->uri);
// Create regex pattern which detects UTF-8 encoding.
$regex = '%^(?:
  [\x09\x0A\x0D\x20-\x7E]              # ASCII
  | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
  | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
  | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
  | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
  | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
  | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
  | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs';
if (!preg_match($regex, $file_content)) {
  // Not valid UTF-8 encoding so flag an error.
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow