A regex for validating all utf-8 chars is:
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Removing the match for 4-byte chars will allow only the characters that can be stored in utf8_general.
function removeInvalidChars ($text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2}) | ./x';
return preg_replace($regex, '$1', $text);
}
btw it's the character set that matters not the collation. Also you would be much better off just switching to utf8mb4 with utf8mb4_unicode_ci rather than putting a hack like this in.