Question

I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:

preg_match('!^[\w.-]*$!',$filename) 

This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?

Was it helpful?

Solution

PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extension­Docs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.

You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.

The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_ereg­Docs, mb_eregi­Docs, mb_ereg_replace­Docs and mb_eregi_replace­Docs.

PCRE based regular expression functions like preg_match­Docs support UTF-8 by using the u-modifier (PCRE8)­Docs.

But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top