Pregunta

I am trying to write some code to take UTF-8 text and create a slug that contains some UTF-8 characters. So this is not about transliterating UTF-8 into ASCII.

So basically I want to replace any UTF-8 character that is whitespace, a control character, punctuation, or a symbol with a dash. There exist Unicode categories that I should be able to use: \p{Z}, \p{C}, \p{P}, or \p{S}, respectively.

So I could do something as simple as this:

preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#", "-", "This. test? has an ö in it");

but it results in this:

This-test-has-an-√-in-it

(and I'd want This-test-has-an-ö-in-it)

It butchers the umlaut o, possibly because in Unicode it is comprised of two bytes c3b6 of which the b6 seems to be recognized as a punctuation character (so the \p{P} matches here). The c3 remains in the text. This is strange because AFAIK a single byte b6 doesn't exist in UTF-8.

I also tried the same thing in Perl in order to make sure it is not a PHP problem, but the code

$s = 'This. test? has an ö in it';
$s =~ s/(\p{P}|\p{C}|\p{S}|\p{Z})+/-/g;

also produces:

This-test-has-an-√-in-it

(which probably makes sense as PHP's PCRE are Perl Compatible Regular Expressions)

While when I do this in Python

import regex as re
text=u"This. test? has an ö in it";
print re.sub(ur"(\p{P}|\p{C}|\p{S}|\p{Z})+", "-", text)

it produces my desired

This-test-has-an-ö-in-it

What to do?

¿Fue útil?

Solución

The solution was to use the "Unicode modifier" u:

preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#u", "-", "This. test? has an ö in it");

correctly produces

This-test-has-an-ö-in-it

So: using Unicode Categories without the Unicode modifier produces strange results without any warning.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top