Javascript replace() regular expression too greedy

Question 1

There are several mistakes in the pattern and possible improvements:

/<
\s*    #  not needed (browsers don't recognize "< a" as an "a" tag)

a      #  if you want to avoid a confusion between an "a" tag and the start
       # of an "abbr" tag, you can add a word boundary or better, a "\s+" since
       # there is at least one white character after.

.      #  The dot match all except newlines, if you have an "a" tag on several
       # lines, your pattern will fail. Since Javascript doesn't have the 
       # "singleline" or "dotall" mode, you must replace it with `[\s\S]` that
       # can match all characters (all that is a space + all that is not a space)

*      #  Quantifiers are greedy by default. ".*" will match all until the end of
       # the line, "[\s\S]*" will match all until the end of the string!
       # This will cause to the regex engine a lot of backtracking until the last
       # "href" will be found (and it is not always the one you want)

href=  # You can add a word boundary before the "h" and put optional spaces around
       # the equal sign to make your pattern more "waterproof": \bhref\s*=\s*

\"     #  Don't need to be escaped, as Markasoftware notices it, an attribute
       # value is not always between double quotes. You can have single quotes or
       # no quotes at all. (1)
(.*?)
\"     # same thing
.*     # same thing: match all until the last >
>(.*?)<\/a>/gi

(1) -> About the quotes and the href attribute value:

To deal with single, double or no quotes you can use a capturing group and a backreference:

\bhref\s*=\s*(["']?)([^"'\s>]*)\1

details:

\bhref\s*=\s*
(["']?)     # capture group 1: can contain a single, a double quote or nothing 
([^"'\s>]*) # capture group 2: all that is not a quote to stop before the possible
            # closing quote, a space (urls don't have spaces, however javascript
            # code can contain spaces) or a ">" to stop at the first space or
            # before the end of the tag if quotes are not used. 
\1          # backreference to the capture group 1

Note that is you use this subpattern you add a capturing group, and the content between a tags is now in the capture group 3. Think to change in your replacement string $2 to $3.

In fine, you can write your pattern like this:

aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
               '$3 (Link->$1)');

Question 2

use

href="[^"]+"

instead of

href=\"(.*?)\"

basically this will grab any character till it meets the next "

Though it would probably be easier to implement something like markdown syntax that way you would not have to worry about stripping out the wrong tags, just strip all and replace the markdowns with their html tag counterparts when displaying the text.

For instance on SO you can make a link by just using

[link text](http://linkurl.com)

and a regex to do the replace would be

var displayText = "This is just some text [and this is a link](http://example.com) and then more text";
var linkMarkdown = /\[([^\]]+)\]\(([^\)]+)\)/;
displayText.replace(linkMarkdown,'<a href="$2">$1</a>');

Or use a already made library that will do the conversions.

Question 3

Thank you all for the suggestions; it helped me a lot and had lots of ideas for improving it.

But I think I found the specific cause of the original regex failing. Casimir's answer touches on it, but I didn't understand it until I happened upon this fix.

I had been looking in the wrong place for the problem, here:

/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi
                       ^

I was able to fix my original query by inserting a question mark after the a.*hre area, like this:

/<\s*a.*?href=\"(.*?)\".*>(.*?)<\/a>/gi
        ^

I do plan on using the other suggestions here to improve my statement further.

-- C