Use the multibyte safe functions mb_regex_encoding()
and mb_ereg_replace()
. (I'm not convinced the first one is mandatory. Also try without and see if that is sufficient.)
preg_match multi-byte characters by length
-
19-07-2023 - |
Pregunta
I have some data in an India language encoding. I want to remove parts where there are only one or two characters, e.g. this is two characters:
ಎನ್
but they are multi-byte
I've tried to match these using the regex:
'~\b[^ ]{1,2}\b~u'
but it is not working. Any idea?
As per the selected answer, the solution in to use the mb_ereg funcions. This worked for me:
mb_regex_encoding( 'UTF-8' );
setlocale( LC_CTYPE, 'en_US.UTF-8' );
$str = 'ಆರ್ ವೆಂಕಟಲಕ್ಷ್ಮಿ ಎಸ್ ಎನ್ ಎನ್ ಪದ್ಮಾವತಿ ಎನ್';
echo $str . "\n";
echo mb_ereg_replace( '\b[^\s]{2,4}\b', ' @ ', $str );
echo "\n";
Result:
@ ವೆಂಕಟಲಕ್ಷ್ಮಿ @ @ @ ಪದ್ಮಾವತಿ @
This will not work with preg functions.
Solución
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow