Question

This should be simple, but it's eluding me. There are many good and bad regex methods to match a URL, with or without the protocol, with or without www. The problem I have is this (in javascript): if I use regex to match URLs in a text string, and set it so that it will match just 'domain.com', it also catches the domain of an e-mail address (the part after '@'), which I don't want. A negative lookbehind solves it - but obviously not in JS.

This is my nearest success so far:

 /^(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g

but that fails if the match is not at the start of the string. And I'm sure I'm tackling it the wrong way. Is there a simple answer out there anywhere?

EDIT: Revised regex to respond to a few of the comments below (sticks with 'www' rather than allowing sub-domains:

\b(www\.)?([^@])(\w*\.)(\w{2,3})(\.\w{2,3})?(\/\S*)?$

As mentioned in the comments however, this still matches the domain after a @.

Thanks

Was it helpful?

Solution 2

After a lot of messing about, this ended up working (with a definite hat tip to @zmo's final comment):

var rx = /\b(www\.)?(\w*@)?([a-zA-Z\-]*\.)(com|org|net|edu|COM|ORG|NET|EDU)(\.au)?(\/\S*)?/g;
var link = txt.match(rx);
    if(link !== null) {
    for(var i = 0; i < link.length; i++) {
      if (link[i].indexOf('@') == -1) {
         //create link
       } else {
        //create mailto;
       }
       }
       }

I'm aware of the limitations with regard to sub-domains, TLDs, etc. (which@zmo has addressed above - and if you need to catch all URLs, I'd suggest you adapt that code), but that was not the main issue in my case. The code in my answer allows matches to URLs present in a text string without 'www.', without also catching the domain of an e-mail address.

OTHER TIPS

that fails if the match is not at the start of the string

it's because of the ^ at the beginning of the match:

/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g

js> "www.foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["www.foobar.com"]
js> "aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
[" foobar.com"]
js> "toto@aoeuaoeu toto@foobar.com".match(/(www\.)?([^@])([a-z]*\.)(com|net|edu|org)(\.au)?(\/\S*)?$/g)
["foobar.com"]

though it's still matching a space before the domain. And it's making wrong assumptions about the domain…

  • xyz.example.org is a valid domain not matched by your regexp ;
  • www.3x4mpl3.org is a valid domain not matched by your regexp ;
  • example.co.uk is a valid domain not matched by your regexp ;
  • ουτοπία.δπθ.gr is a valid domain not matched by your regexp.

What defines a legal domain name? It's just a sequence of utf-8 characters separated by dots. It can't have two dots following each other, and the canonical name is \w\.\w\w (as I don't think a one letter tld exists).

Though, the way I'd do it is to simply match everything that looks like a domain, by taking everything that is text with a dot separator using word boundaries (\b):

/\b(\w+\.)+\w+\b/g

js> "aoe toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar"]
js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/\b(\w+\.)+\w+\b/g)
["example.org", "toto.example.org", "foo.bar", "f00bar.com"]

and then make a second round to check whether the domain really exists or not in the list of domains found. The downside is that regexps in javascript can't check against unicode characters, and either \b or \w won't accept ουτοπία.δπθ.gr as a valid domain name.

In ES6, there's the /u modifier, which should work with latest browsers (but none that I have tested so far):

"ουτοπία.δπθ.gr aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu".match(/\b(\w+\.)+\w+\b/gu)

edit:

A negative lookbehind solves it - but obviously not in JS.

yes it will: for skipping all e-mail addresses, here's a working look behind implementation of the regex:

/(?![^@])?\b(\w+\.)+\w+\b/g

js> "aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/(?<![^@])?\b(\w+\.)+\w+\b/g)
["toto.example.org", "foo.bar", "f00bar.com"]

though it's the same as unicode… it'll be there in JS soon…

the only way around there is, is to actually preserve the @ in the matched regexp, and discard any match that contains an @:

js> "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g).map(function (x) { if (!x.match(/@/)) return x })
["toto.net", (void 0), "toto.example", "foo.bar", "f00bar.com"]

or use the new list comprehension from ES6/JS1.7, which should be there in modern browsers…

[x for x of "toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b\w+\.+\w+\b/g) if (!x.match(/@/))];

one final update:

/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g

> "x.y tot.toc.toc $11.00 11.com 11foo.com toto.11 toto.net aoe toto@example.org toto.example.org  uaoeu foo.bar aoeuaoeu f00bar.com".match(/@?\b(\w*[^\W\d]+\w*\.+)+[^\W\d_]{2,}\b/g).filter(function (x) { if (!x.match(/@/)) return x })
[ 'tot.toc.toc',
  '11foo.com',
  'toto.net',
  'toto.example.org',
  'foo.bar',
  'f00bar.com' ]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top