Properly validate UTF-8 characters for insertion in a table with utf8_general_ci colocation

https://stackoverflow.com/questions/22115604

18-10-2022
|

سؤال

While the real problem is the colocation of the field on the database, i can't change it. I need to drop invalid characters instead.

Using @iconv('utf-8', 'utf-8//IGNORE'); won't work, because the characters are valid UTF8 characters, but invalid when inserted in a field with that colocation.

$broken_example = '&#8634;&#65158;&#2496;&#9628;&#5067;&#65079;&#4634;&#9718;&#65382;&#632;&#681;&#119928;&#5785;&#67844;&#127199;&#691;&#11800;&#6502;&#3590;&#11614;&#19967;&#42522;&#42331;&#119190;&#119249;&#119230;&#8634;&#65158;&#2496;&#9628;&#5067;&#65079;&#4634;&#9718;&#65382;&#632;&#681;&#119928;&#5785;&#67844;&#127199;&#691;&#11800;&#6502;&#3590;&#11614;&#19967;&#42522;&#42331;&#119190;&#119249;&#119230;';
$utf8 = html_entity_decode($broken_example, ENT_QUOTES, 'UTF-8');

I've tried to use some workaround like preg_replace('/&#([0-9]{6,});/', '');, but with no success.

The error mysql is reporting is Incorrect string value: '\xF0\x90\xA4\x84\xCA\xB3...'

المحلول

A regex for validating all utf-8 chars is:

function removeInvalidChars ($text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}

Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.

function removeInvalidChars ($text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
    return preg_replace($regex, '$1', $text);
}

btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow