Question

I've run into a bit of a problem with a Regex I'm using for humans names.

$rexName = '/^[a-z' -]$/i';

Suppose a user with the name Jürgen wishes to register? Or Böb? That's pretty commonplace in Europe. Is there a special notation for this?

EDIT:, just threw the Jürgen name against a regex creator, and it splits the word up at the ü letter...

http://www.txt2re.com/index.php3?s=J%FCrgen+Blalock&submit=Show+Matches

EDIT2: Allright, since checking for such specific things is hard, why not use a regex that simply checks for illegal characters?

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";

(now which ones of these can actually be used in any hacking attempt?)

For instance. This allows ' and - signs, yet you need a ; to make it work in SQL, and those will be stopped.Any other characters that are commonly used for HTML injection of SQL attacks that I'm missing?

Was it helpful?

Solution

I would really say : don't try to validate names : one day or another, your code will meet a name that it thinks is "wrong"... And how do you think one would react when an application tells him "your name is not valid" ?

Depending on what you really want to achieve, you might consider using some kind of blacklist / filters, to exclude the "not-names" you thought about : it will maybe let some "bad-names" pass, but, at least, it shouldn't prevent any existing name from accessing your application.

Here are a few examples of rules that come to mind :

  • no number
  • no special character, like "~{()}@^$%?;:/*§£ø and probably some others
  • no more that 3 spaces ?
  • none of "admin", "support", "moderator", "test", and a few other obvious non-names that people tend to use when they don't want to type in their real name...
    • (but, if they don't want to give you their name, their still won't, even if you forbid them from typing some random letters, they could just use a real name... Which is not their's)

Yes, this is not perfect ; and yes, it will let some non-names pass... But it's probably way better for your application than saying someone "your name is wrong" (yes, I insist ^^ )


And, to answer a comment you left under one other answer :

I could just forbid the most command characters for SQL injection and XSS attacks,

About SQL Injection, you must escape your data before sending those to the database ; and, if you always escape those data (you should !), you don't have to care about what users may input or not : as it is escaped, always, there is no risk for you.

Same about XSS : as you always escape your data when ouputting it (you should !), there is no risk of injection ;-)


EDIT : if you just use that regex like that, it will not work quite well :

The following code :

$rexSafety = "/^[^<,\"@/{}()*$%?=>:|;#]*$/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will get you at least a warning :

Warning: preg_match() [function.preg-match]: Unknown modifier '{'

You must escape at least some of those special chars ; I'll let you dig into PCRE Patterns for more informations (there is really a lot to know about PCRE / regex ; and I won't be able to explain it all)

If you actually want to check that none of those characters is inside a given piece of data, you might end up with something like that :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'martin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

(This is a quick and dirty proposition, which has to be refined!)

This one says "OK" (well, I definitly hope my own name is ok!)
And the same example with some specials chars, like this :

$rexSafety = "/[\^<,\"@\/\{\}\(\)\*\$%\?=>:\|;#]+/i";
if (preg_match($rexSafety, 'ma{rtin')) {
    var_dump('bad name');
} else {
    var_dump('ok');
}

Will say "bad name"

But please note I have not fully tested this, and it probably needs more work ! Do not use this on your site unless you tested it very carefully !


Also note that a single quote can be helpful when trying to do an SQL Injection... But it is probably a character that is legal in some names... So, just excluding some characters might no be enough ;-)

OTHER TIPS

PHP’s PCRE implementation supports Unicode character properties that span a larger set of characters. So you could use a combination of \p{L} (letter characters), \p{P} (punctuation characters) and \p{Zs} (space separator characters):

/^[\p{L}\p{P}\p{Zs}]+$/

But there might be characters that are not covered by these character categories while there might be some included that you don’t want to be allowed.

So I advice you against using regular expressions on a datum with such a vague range of values like a real person’s name.


Edit   As you edited your question and now see that you just want to prevent certain code injection attacks: You should better escape those characters rather than rejecting them as a potential attack attempt.

Use mysql_real_escape_string or prepared statements for SQL queries, htmlspecialchars for HTML output and other appropriate functions for other languages.

That's a problem with no easy general solution. The thing is that you really can't predict what characters a name could possibly contain. Probably the best solution is to define an negative character mask to exclude some special characters you really don't want to end up in a name.

You can do this using:

$regexp = "/^[^<put unwanted characters here>]+$/

If you're trying to parse apart a human name in PHP, I recomment Keith Beckman's nameparse.php script.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top