You can discard characters that are not supported by an encoding, with iconv()
:
$converted = iconv($input_encoding, $output_encoding . '//IGNORE', $original);
There are two drawbacks:
- You need to know the input encoding, and
as you can read in a user comment in the manual,
iconv()
has a bug so that'//IGNORE'
does not work with recent versions of the iconv library. The suggested workaround is (here for UTF-8):ini_set('mbstring.substitute_character', 'none'); $text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
However, it is much better to attempt to detect the input encoding and convert the input to the output encoding. This leads to:
function recode ($input, $output_encoding)
{
$input_encoding = mb_detect_encoding($input);
if ($input_encoding === false)
{
$old_substitute = mb_substitute_character();
mb_substitute_character('none');
$converted = mb_convert_encoding($input, $output_encoding, $output_encoding);
mb_substitute_character($old_substitute);
}
else
{
$converted = ($output_encoding !== $input_encoding)
? iconv($input_encoding, $output_encoding, $input)
: $input;
}
return $converted;
}