I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.

The user may have configured for the following languages:

  1. English
  2. French
  3. Spanish
  4. Portuguese
  5. Hawaiian

I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).

I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.


NOTE: If you feel this question is badly worded, please add a comment to help me improve it.

有帮助吗?

解决方案

You're mixing up a couple of things:

  • keyboard layout is what determines what value get assigned to a scancode.
  • localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
  • character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs

If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.

blank key happy hacking keyboard

The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.

The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.

What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.

You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.

The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.

As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top