Characters not getting matched by [A-Za-z]

https://stackoverflow.com/questions/21689981

regex
utf-16

09-10-2022
|

Domanda

I am trying to match all latin characters in UTF 16 encoded text. I have been using [A-Za-z] which has been working great. As I've been parsing chinese and japanese text I've been coming across bizarre versions of A-Z that the regex isn't picking up.

https://gist.github.com/kyleect/1c66fd388d362653969d

Left are the characters I can't identify, right is from my keyboard. I copy and pasted them in to chrome page find input, google search and the find input in my text editor. All agree: Left == Right but Right != Left

What are these characters and wow do I target them in regex?

Soluzione

You can take a look at their character codes in your browser’s console:

> 'Ｂ'.charCodeAt(0).toString(16)
ff22

It’s a fullwidth letter! You can probably match the whole set with [\uff21-\uff3a] in a decent regex engine. Or Ａ-Ｚ in an even more decent one.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow