that fails if the match is not at the start of the string
it's because of the ^
at the beginning of the match:
/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g
js> "www.foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["www.foobar.com"]
js> "aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu toto@foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["foobar.com"]
though it's still matching a space before the domain. And it's making wrong assumptions about the domain…
xyz.example.org
is a valid domain not matched by your regexp ;
www.3x4mpl3.org
is a valid domain not matched by your regexp ;
example.co.uk
is a valid domain not matched by your regexp ;
ουτοπία.δπθ.gr
is a valid domain not matched by your regexp.
What defines a legal domain name? It's just a sequence of utf-8 characters separated by dots. It can't have two dots following each other, and the canonical name is \w\.\w\w
(as I don't think a one letter tld exists).
Though, the way I'd do it is to simply match everything that looks like a domain, by taking everything that is text with a dot separator using word boundaries (\b
):
/\b(\w+\.)+\w+\b/g
js> "aoe toto.example.org uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar", "f00bar.com"]
and then make a second round to check whether the domain really exists or not in the list of domains found. The downside is that regexps in javascript can't check against unicode characters, and either \b
or \w
won't accept ουτοπία.δπθ.gr
as a valid domain name.
In ES6, there's the /u
modifier, which should work with latest browsers (but none that I have tested so far):
"ουτοπία.δπθ.gr aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/gu)
edit:
A negative lookbehind solves it - but obviously not in JS.
yes it will: for skipping all e-mail addresses, here's a working look behind implementation of the regex:
/(?![^@])?\b(\w+\.)+\w+\b/g
js> "aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/(?<![^@])?\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar", "f00bar.com"]
though it's the same as unicode… it'll be there in JS soon…
the only way around there is, is to actually preserve the @
in the matched regexp, and discard any match that contains an @:
js> "toto.net aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g).map(function (x) { if (!x.match(/@/)) return x })
["toto.net", (void 0), "toto.example", "foo.bar", "f00bar.com"]
or use the new list comprehension from ES6/JS1.7, which should be there in modern browsers…
[x for x of "toto.net aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g) if (!x.match(/@/))];
one final update:
/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g
> "x.y tot.toc.toc $11.00 11.com 11foo.com toto.11 toto.net aoe toto@example.org toto.example.org uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g).filter(function (x) { if (!x.match(/@/)) return x })
[ 'tot.toc.toc',
'11foo.com',
'toto.net',
'toto.example.org',
'foo.bar',
'f00bar.com' ]