Properly validate UTF-8 characters for insertion in a table with utf8_general_ci colocation

StackOverflow https://stackoverflow.com/questions/22115604

  •  18-10-2022
  •  | 
  •  

سؤال

While the real problem is the colocation of the field on the database, i can't change it. I need to drop invalid characters instead.

Using @iconv('utf-8', 'utf-8//IGNORE'); won't work, because the characters are valid UTF8 characters, but invalid when inserted in a field with that colocation.

$broken_example = '↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮↺ﺆী▜Ꮛ︷ሚ◶ヲɸʩ𝑸ᚙ𐤄🃟ʳ⸘ᥦฆⵞ䷿ꘚꕛ𝆖𝇑𝆺𝅥𝅮';
$utf8 = html_entity_decode($broken_example, ENT_QUOTES, 'UTF-8');

I've tried to use some workaround like preg_replace('/&#([0-9]{6,});/', '');, but with no success.

The error mysql is reporting is Incorrect string value: '\xF0\x90\xA4\x84\xCA\xB3...'

هل كانت مفيدة؟

المحلول

A regex for validating all utf-8 chars is:

function removeInvalidChars ($text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
    return preg_replace($regex, '$1', $text);
}

Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.

function removeInvalidChars ($text) {
    $regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
    return preg_replace($regex, '$1', $text);
}

btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top