Question

I'm currently using the following regular expression to validation URLs:

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?  (?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|edu|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

I borrowed this from somewhere on the web (don't remember where) to improve upon this:

^((https?|file|ftp|gopher|news|nntp):\/\/)([a-z]([a-z0-9\-]*\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel)|(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-z][a-z0-9_]*)?$

However, neither of these are capable of validating this url (which should be valid):

http://somedomain.com/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg

The problem is the %20 and round brackets (). Try as I might, I couldn't get either of the regex above to correctly validate the url above without breaking something else. I'm not experienced with writing fancy regular expressions, so that doesn't help either. All other web results I've found fail on silly things such as this:

http://www.test..com

Help would be appreciated.

Was it helpful?

Solution

You're validating two things with the same regular expression:

  • Well formed -- Is it syntactically correct?
  • Plausible -- Are the protocol and top-level domain plausible?

Separating these validations may be fruitful. You can use this regular expression to check that the URI is well-formed. It's from RFC 3986, Uniform Resource Identifiers (URI): Generic Syntax, appendix B (p. 50):

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

If the URI matches this regular expression, it's well formed. The match groups give you the various pieces, which are:

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

Let's see what comes out of the sample URI you gave:

2 (scheme)   : "http"
4 (authority): "somedomain.com"
5 (path)     : "/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg"
7 (query)    : nil
9 (fragment) : nil

Now that you've got the individual pieces, you can check each one for plausibility. For example, to get the TLD from the authority, apply this regular expression to the authority:

\.([^.])$

Group 1 gives you the TLD (com, org, etc.), which you can then check against your list.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top