Javascript Regex - unexpected behaviour on faking lookbehind

https://stackoverflow.com/questions/11278459

18-06-2021
|

Question

I am trying to code a widget that collates Tweets from multiple sources as an exercise (something similar exists here, but a) the list option offered there did not load any of my lists, and b) it is a useful learning exercise!). As part of this, I wanted to write a regex which replaces a Twitter handle ('@' followed by characters) with a link to the user's Twitter page. However, I did not want false positives for, for instance, an email address in a tweet.

So, for instance, the replacement should send

Hey there @twitteruser, my email address is address@gmail.com

Hey there <a href="http://twitter.com/twitteruser">@twitteruser</a>, my email address is address@gmail.com

Guided by this question, which suggested that I needed some way of replicating negative look-behinds in Javascript, I wrote the following code:

tweetText = tweetText.replace(/(\S)?@([^\s,.;:]*)/ig, function($0, $1){
    return $1 ? $0 + '@' + $1 : '<a href="http://www.twitter.com/' + $0 + '">@' + $0 + '</a>'
});

However, in the cases where the final part of the ternary operator is triggered, $0 contains the '@' symbol. This was unexpected for me - since the '@' was not enclosed in parentheses, I expected $0 to match '([^\s,.;:]*)' - that is, the username of the Twitter user (after, and without, the '@'). I can get the desired functionality by using $0.substring(1), but I would like to further my understanding.

Could someone please point out what I have misunderstood? I am quite new to Regexs, and have never written them in Javascript, nor have I ever used negative look-behinds.

Solution

In any case, instead of trying to match an optional non-space before the @, and rejecting the match if you find one, why not just require a space (or the beginning of the string) before the @?

tweetText = tweetText.replace(
    /(^|\s)@([^\s,.;:]*)/g,
    '$1<a href="http://www.twitter.com/$2">@$2</a>'
);

Not only is this simpler, but it's likely to be quite a bit faster too, since the regexp needs to consider much fewer potential matches.

OTHER TIPS

I think you might be complicating things too much. Try this to retrieve the usernames and then make your own helper function to create the markup.

var getTwitter = function (str) {
  var re = /[^\w](@\w+)/g,
      matches = [],
      tweets = []
  while (matches = re.exec(str))
    tweets.push(matches[1])
  return tweets
}

Demo: http://jsfiddle.net/elclanrs/gLvX4/

As is standard behaviour in most REGEX implementations, match zero is the whole match (including, as part of it, any sub-matches - even any that are marked as non-capturing), then any subsequent matches are the captured sub-matches. Check out www.regular-expressions.info. For example:

console.log('hello, there'.match(/\w+(?:,) ?(\w+)/));

Gives you the array

["hello, there", "there"] //the first sub-match is non-capturing

JavaScript does not support look-behinds but there are simulations for this, none perfect, like the one I wrote. JavaScript's REGEXP implementation in general is weaker than that of some other languages. Some examples of omissions include:

look-behinds
named atomic groups
most of the modifiers (though the key ones are there - global, case-insensitive and multi-line)
crucially, the ability to capture sub-groups whilst also matching globally

You're overcomplicating, is not that complicated. You can do everything once on a single line of code, just do this \W@(\w+)

Live demo http://jsfiddle.net/Victornpb/Wugvd/

//make twitter username links
function linkTwitterNames(elm){
    elm.innerHTML = elm.innerHTML.replace(/\W@(\w+)/g, ' <a class="twitter" href="http://twitter.com/$1" target="_blank">@$1</a>');
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow