Pregunta

I have a set of characters like

., !, ?, ;, (space)

and a string, which may or may not be UTF 8 (any language).

Is there a easy way to find out if the string has one of the character set above?

For example:

这是一个在中国的字符串。

which translates to

This is a string in chinese.

The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?

Or maybe there's a list somewhere with Unicode punctuation character codes?

¿Fue útil?

Solución

In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_matchDocs and the u modifier.

echo preg_match('/pP$/u', $str);

However, your string needs to be UTF-8 to do that.

You can test this on your own, I created a little script that tests for all properties via preg_match:

Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).

Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).

Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.

Otros consejos

Yes, (U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than . (U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:

preg_match('/[.!?;。]/u', $str, $match)

This will return either 0 or 1 and $match will contain the matched character. With this it’s important that your string in $str is properly encoded in UTF-8.

If you want to match any Unicode punctuation character, you can use the pattern \p{P} to describe the Unicode character property instead:

/\p{P}/u

you are not trying to transliterate, you are trying to translate!

UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world

what you are trying to do is something like this:

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "à è ò ù");

that not works with your chinese example

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top