Regular Expression Doesn't Work Properly With Turkish Characters

https://stackoverflow.com/questions/16579113

29-05-2022
|

質問

I write a regex that should extracts following patterns;

"çççoookkk gggüüüzzzeeelll" (it means vvveeerrryyy gggoooddd with turkish characters "ç" and "ü")
"ccccoookkk ggguuuzzzeeelll" (it means the same but with english characters "c" and "u")

here is the regular expressions i'm trying;

"\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
"çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
"güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
"\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
"[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.

I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.

I am using http://www.myregextester.com to check if my regular expressions are correct.

I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.

Thanks,

解決

You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.

Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].

If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).

See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.

As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.

Se also: utf-8 word boundary regex in javascript