Question

I want regex to match web addresses such as http://www.example.com, example.co.uk, en.example.com etc. I've been using ^(https?://|www\.|)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$ and testing it on http://regexpal.com/, and it seems to work exactly as it should.

However, when I put it in autohotkey, it seems to match extra things like example and example.something, when it shouldn't. It then doesn't match things like example.com/something and example.com/something.html when it should.

If RegExMatch(Clipboard, "^(https?://|www\.|)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$")
    Msgbox, it matches
else
    Msgbox, it doesn't
Was it helpful?

Solution

Matching URLs, host names etc is a problem solved many times; I suggest you adapt some standard regex. Perhaps SO question: Fully qualified domain name validation is helpful.


If you're composing the regex as an exercise:

Does it really match the string example? You firmly assert the string to contain a ., so it never should. Maybe AHK doesn't escape . the standard way?

If [a-zA-Z]{2,3} should match top level domain, you forgot about .info.

You may want to allow strings of whitespace of arbitrary length at the end and beginning, if you accidentally copied some such into the clipboard. I.e. ^\s*your-regex-thingy\s*$

example.something is a match, because it begins with the empty string, follows with a sequence of 1 or more alphanumerics (or -, .), one ., 2 or 3 letters, and ends with a sequence of non-whitespace.

example.com/something.html might fail to match if the entire substring example.com is matched by the group [a-zA-Z0-9\-\.]+. It shouldn't if the regex engine is correctly implemented, though. Perhaps you need to escape +, | or some such, engines have varying conventions on such (i.e. sed and pcre have differing opinions on + and ( if I'm not mistaken.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top