سؤال

I have some data in an India language encoding. I want to remove parts where there are only one or two characters, e.g. this is two characters:

ಎನ್

but they are multi-byte

I've tried to match these using the regex:

'~\b[^ ]{1,2}\b~u'

but it is not working. Any idea?

As per the selected answer, the solution in to use the mb_ereg funcions. This worked for me:

mb_regex_encoding( 'UTF-8' );
setlocale( LC_CTYPE, 'en_US.UTF-8' );
$str = 'ಆರ್‌ ವೆಂಕಟಲಕ್ಷ್ಮಿ ಎಸ್‌ ಎನ್‌ ಎನ್‌ ಪದ್ಮಾವತಿ ಎನ್';
echo $str . "\n";
echo mb_ereg_replace( '\b[^\s]{2,4}\b', ' @ ', $str );
echo "\n";

Result:

 @ ‌ ವೆಂಕಟಲಕ್ಷ್ಮಿ  @ ‌  @ ‌  @ ‌ ಪದ್ಮಾವತಿ  @

This will not work with preg functions.

هل كانت مفيدة؟

المحلول

Use the multibyte safe functions mb_regex_encoding() and mb_ereg_replace(). (I'm not convinced the first one is mandatory. Also try without and see if that is sufficient.)

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top