Question

I am trying to sanitize an HTML input field. I want to keep some of the tags, but not all of them, so I can't just use .text() when reading the element value. I am having a bit of trouble with a regular expression in JavaScript in Safari. Here's the snippet of code (I copied this bit of regex from another SO thread answer):

aString.replace (/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi, '$2 (Link->$1)' ) ;

Here is the sample input that is failing:

<a href="http://blar.pirates.net/black/ship.html">Go here please.</a></p><p class="p1"><a href="http://blar.pirates.net/black/ship.html">http://blar.pirates.net/black/ship.html</a></p>

The idea is that the href will get pulled out and output as plain text next to the text that would have been linked. So the above output should ultimately be something like:

Go here please (Link->http://blar.pirates.net/black/ship.html)
http://blar.pirates.net/black/ship.html (Link->http://blar.pirates.net/black/ship.html)

However, the regex is grabbing all the way down to the second </a> tag on the first match, so I am losing the first line of output. (Actually, it will grab as far down the list as long as the anchor elements are adjacent.) The input is one long string, not split over lines with a CR/LF or anything.

I have tried using a non-greedy flag like this (note the 2nd question mark):

/<\s*a.*href=\"(.*?)\".*?>(.*?)<\/a>/ig

But that didn't seem to change anything (at least not in the few tester/parsers I tried, one of which is here: http://refiddle.com). Have also tried the /U flag but that didn't help (or these parsers didn't recognize it).

Any suggestions?

Was it helpful?

Solution

There are several mistakes in the pattern and possible improvements:

/<
\s*    #  not needed (browsers don't recognize "< a" as an "a" tag)

a      #  if you want to avoid a confusion between an "a" tag and the start
       # of an "abbr" tag, you can add a word boundary or better, a "\s+" since
       # there is at least one white character after.

.      #  The dot match all except newlines, if you have an "a" tag on several
       # lines, your pattern will fail. Since Javascript doesn't have the 
       # "singleline" or "dotall" mode, you must replace it with `[\s\S]` that
       # can match all characters (all that is a space + all that is not a space)

*      #  Quantifiers are greedy by default. ".*" will match all until the end of
       # the line, "[\s\S]*" will match all until the end of the string!
       # This will cause to the regex engine a lot of backtracking until the last
       # "href" will be found (and it is not always the one you want)

href=  # You can add a word boundary before the "h" and put optional spaces around
       # the equal sign to make your pattern more "waterproof": \bhref\s*=\s*

\"     #  Don't need to be escaped, as Markasoftware notices it, an attribute
       # value is not always between double quotes. You can have single quotes or
       # no quotes at all. (1)
(.*?)
\"     # same thing
.*     # same thing: match all until the last >
>(.*?)<\/a>/gi

(1) -> About the quotes and the href attribute value:

To deal with single, double or no quotes you can use a capturing group and a backreference:

\bhref\s*=\s*(["']?)([^"'\s>]*)\1

details:

\bhref\s*=\s*
(["']?)     # capture group 1: can contain a single, a double quote or nothing 
([^"'\s>]*) # capture group 2: all that is not a quote to stop before the possible
            # closing quote, a space (urls don't have spaces, however javascript
            # code can contain spaces) or a ">" to stop at the first space or
            # before the end of the tag if quotes are not used. 
\1          # backreference to the capture group 1

Note that is you use this subpattern you add a capturing group, and the content between a tags is now in the capture group 3. Think to change in your replacement string $2 to $3.

In fine, you can write your pattern like this:

aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
               '$3 (Link->$1)');

OTHER TIPS

use

href="[^"]+"

instead of

href=\"(.*?)\"

basically this will grab any character till it meets the next "

Though it would probably be easier to implement something like markdown syntax that way you would not have to worry about stripping out the wrong tags, just strip all and replace the markdowns with their html tag counterparts when displaying the text.

For instance on SO you can make a link by just using

[link text](http://linkurl.com)

and a regex to do the replace would be

var displayText = "This is just some text [and this is a link](http://example.com) and then more text";
var linkMarkdown = /\[([^\]]+)\]\(([^\)]+)\)/;
displayText.replace(linkMarkdown,'<a href="$2">$1</a>');

Or use a already made library that will do the conversions.

Thank you all for the suggestions; it helped me a lot and had lots of ideas for improving it.

But I think I found the specific cause of the original regex failing. Casimir's answer touches on it, but I didn't understand it until I happened upon this fix.

I had been looking in the wrong place for the problem, here:

/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi
                       ^

I was able to fix my original query by inserting a question mark after the a.*hre area, like this:

/<\s*a.*?href=\"(.*?)\".*>(.*?)<\/a>/gi
        ^

I do plan on using the other suggestions here to improve my statement further.

-- C

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top