Domanda

I have some data in an India language encoding. I want to remove parts where there are only one or two characters, e.g. this is two characters:

ಎನ್

but they are multi-byte

I've tried to match these using the regex:

'~\b[^ ]{1,2}\b~u'

but it is not working. Any idea?

As per the selected answer, the solution in to use the mb_ereg funcions. This worked for me:

mb_regex_encoding( 'UTF-8' );
setlocale( LC_CTYPE, 'en_US.UTF-8' );
$str = 'ಆರ್‌ ವೆಂಕಟಲಕ್ಷ್ಮಿ ಎಸ್‌ ಎನ್‌ ಎನ್‌ ಪದ್ಮಾವತಿ ಎನ್';
echo $str . "\n";
echo mb_ereg_replace( '\b[^\s]{2,4}\b', ' @ ', $str );
echo "\n";

Result:

 @ ‌ ವೆಂಕಟಲಕ್ಷ್ಮಿ  @ ‌  @ ‌  @ ‌ ಪದ್ಮಾವತಿ  @

This will not work with preg functions.

È stato utile?

Soluzione

Use the multibyte safe functions mb_regex_encoding() and mb_ereg_replace(). (I'm not convinced the first one is mandatory. Also try without and see if that is sufficient.)

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top