Question

Hello I am trying to figure out a regular expression to replace text in an innerHTML block to provide local formatting for text similar in operation to Google IM.

Where: 
_Italics_
!Inderline!
*Bold*
-Strike-

Part of the conditions is that the text must be wrapped by the symbol, but if a space follows immediately after then the trigger condition is voided; so * bold* would not be bolded and: * notboldbut this is bold

The innerHTML will have URLS which have already been converted to hrefs so in order to not mess with them, I have added the following to the front of my regex.

    (?!(?!.*?<a)[^<]*<\/a>)

The following javascript does not capture all the results and will have varied results depending on the order in which I conduct the replace.

var boldPattern          = /(?!(?!.*?<a)[^<]*<\/a>)\*([^\s]+[\s\S]?[^\s]+)\*([\s_!-]?)/gi;
var italicsPattern       = /(?!(?!.*?<a)[^<]*<\/a>)_([^\s]+[\s\S]?[^\s]+)_([\s-!\*]?)/gi;
var strikethroughPattern = /(?!(?!.*?<a)[^<]*<\/a>)-([^\s]+[\s\S]?[^\s]+)-([\s_!\*]?)/gi;
var underlinePattern     = /(?!(?!.*?<a)[^<]*<\/a>)!([^\s]+[\s\S]?[^\s]+)!([\s-_\*]?)/gi;
str = str.replace(strikethroughPattern, '<span style="text-decoration:line-through;">$1</span>$2');
str = str.replace(boldPattern, '<span style="font-weight:bold;">$1</span>$2');
str = str.replace(underlinePattern, '<span style="text-decoration:underline;">$1</span>$2');
str = str.replace(italicsPattern, '<span style="font-style:italic;">$1</span>$2');

The test data for the 3 choose 4 looks like:

1 _-*ISB*-_ 2 _-!ISU!-_ 3 _*-IBS-*_ 4 _*!IBU!*_
5 _!-IUS-!_ 6 _!*IUB*!_ 7 -_*SIB*_- 8 -_!SIU!_-
9 -*_SBI_*- 10 -*!SBU!*- 11 -!_SUI_!- 12 -!*SIB*!-
13 *_-BIS-_* 14 *_!BIU!_* 15 *-_BSI_-* 16 *-!BSU!-*
17 *!_BUI_!* 18 *!-BUS-!* 19 !_-UIS-_! 20 !_*UIB*_!
21 !-_USI_-! 22 !-*USB*-! 23 !*_UBI_*! 24 !*-UBS-*!

Can you even have a 4 level deep nested style span like any of the 24 permutations where all 4 modes are selected like:

    -!_*SUIB*_!-

Thanks I've been fighting this for about a week.

Bonus points for avoiding bad feedback from Mozilla for "Markup should not be passed to innerHTML dynamically." (I don't see how that might be possible when one is changing the formatting).

Thanks a million regex wizards! I am in your debt.

mwolfe.

Update

Using the same href detection as above and @talemyn help we are now at:

var boldPattern          = /(?!(?!.*?<a)[^<]*<\/a>)\*([^\s][^\*]*)\*/gi;
var italicsPattern       = /(?!(?!.*?<a)[^<]*<\/a>)_([^\s][^_]*)_/gi;
var strikethroughPattern = /(?!(?!.*?<a)[^<]*<\/a>)-([^\s][^-]*)-/gi;
var underlinePattern     = /(?!(?!.*?<a)[^<]*<\/a>)!([^\s][^!]*)!/gi;
str = str.replace(strikethroughPattern, '<s>$1</s>');
str = str.replace(italicsPattern, '<span style="font-style:italic;">$1</span>');
str = str.replace(boldPattern, '<strong>$1</strong>');
str = str.replace(underlinePattern, '<u>$1</u>');

Which seems to cover an extreme example:

    _wow *a real* !nice *person! on -stackoverflow* figured- it out_ cool beans.

I think one could use the style spans and do a regex lookback to determine the previous unclosed span, close it, open a new span with old format plus new attribute, close when supposed and open a new span to finish the formatting .. but that could get messy or impossible to do with regular expressions as @NovaDenizen points out.

Thank you for all your help. If there are any improvements please let me know. NB: I was unable to use and as the CSS on the site would not render it. Can that be overloaded? [This is for a firefox/greasemonkey/chrome plugin]

UPDATE (almost) FINAL

Using my 'broken' test phrase, as @MikeM correctly stated, as an example it would render correctly (minus the underline) in Google IM whether nested properly or not. So looking at the HTML output from the text in Google IM I noticed that it happily did not preformat the sting but simple did a substitute for as required.

So after looking at the site code which was using resetcss to remove I needed to insert the CSS formatting via javascript. Stackoverflow to the rescue. https://stackoverflow.com/questions/707565/how-do-you-add-css-with-javascript and https://stackoverflow.com/questions/20107/yui-reset-css-makes-strongemthis-not-work-em-strong

So my solution now looks like:

....
var css = document.createElement("style");
css.type = "text/css";
css.innerHTML = "strong, b, strong *, b * { font-weight: bold !important; } \
            em, i, em *, i * { font-style: italic !important; }";
document.body.appendChild(css);
 ....
var boldPattern          = /(?!(?!.*?<a)[^<]*<\/a>)\*([^\s][^\*]*)\*/gi;
var italicsPattern       = /(?!(?!.*?<a)[^<]*<\/a>)_([^\s][^_]*)_/gi;
var strikethroughPattern = /(?!(?!.*?<a)[^<]*<\/a>)-([^\s][^-]*)-/gi;
var underlinePattern     = /(?!(?!.*?<a)[^<]*<\/a>)!([^\s][^!]*)!/gi;
str = str.replace(strikethroughPattern, '<s>$1</s>');
str = str.replace(italicsPattern, '<i>$1</i>');
str = str.replace(boldPattern, '<b>$1</b>');
str = str.replace(underlinePattern, '<u>$1</u>');
.....

And tada it mostly works!

UPDATE FINAL SOLUTION After a last minute simplification on the anchor element check from @MikeM and combining the conditions from another stackoverflow post we have arrived at a complete working solution.

I also needed to add in a check for a one char style with closing symbol, since we were replacing trigger tokens side by side.

As @acheong87 reminded be careful with \w as it includes the _, so that was added to the wrapping conditionals for all but the strikethroughPattern.

var boldPattern          = /(?![^<]*<\/a>)(^|<.>|[\s\W_])\*(\S.*?\S)\*($|<\/.>|[\s\W_])/g;
var italicsPattern       = /(?![^<]*<\/a>)(^|<.>|[\s\W])_(\S.*?\S)_($|<\/.>|[\s\W])/g;
var strikethroughPattern = /(?![^<]*<\/a>)(^|<.>|[\s\W_])-(\S.*?\S)-($|<\/.>|[\s\W_])/gi;
var underlinePattern     = /(?![^<]*<\/a>)(^|<.>|[\s\W_])!(\S.*?\S)!($|<\/.>|[\s\W_])/gi;
str = str.replace(strikethroughPattern, '$1<s>$2</s>$3');
str = str.replace(italicsPattern, '$1<i>$2</i>$3');
str = str.replace(boldPattern, '$1<b>$2</b>$3');
str = str.replace(underlinePattern, '$1<u>$2</u>$3');

Thank you so much everyone (@MikeM, @talemyn, @acheong87, et al.)

mwolfe.

Was it helpful?

Solution 2

I recommend that you remove the inner negative look-aheads from your negative look-aheads:

/(?!(?!.*?<a)[^<]*<\/a>)_it_/.test( ' _it_ <a></a>' );         // true  (correct)
/(?!(?!.*?<a)[^<]*<\/a>)_it_/.test( '<a> _it_ </a>' );         // false (correct)
/(?!(?!.*?<a)[^<]*<\/a>)_it_/.test( '<a> _it_ </a> <a></a>' ); // true  (wrong)

/(?![^<]*<\/a>)_it_/.test( ' _it_ <a></a>' );                  // true  (correct)
/(?![^<]*<\/a>)_it_/.test( '<a> _it_ </a>' );                  // false (correct)
/(?![^<]*<\/a>)_it_/.test( '<a> _it_ </a> <a></a>' );          // false (correct)

OTHER TIPS

Try these:

var boldPattern          = /\*([^\s][^\*]*)\*/gi;
var italicsPattern       = /_([^\s][^_]*)_/gi;
var strikethroughPattern = /-([^\s][^-]*)-/gi;
var underlinePattern     = /!([^\s][^!]*)!/gi;

Though, in the replace, don't use the $2 as there is no second match in those regex patterns.

The following shouldn't create incorrectly nested spans

var old;
var rx = /(?![^<]*(?:>|<\/a>))([!*_-])((?!\1)[^<>\s][^<>]*?)\1/g;

while ( old != str ) {
    old = str;
    str = str.replace( rx, function ( $0, $1, $2 ) {
        var style = $1 == '!' ? "text-decoration:underline"
                  : $1 == '*' ? "font-weight:bold"
                  : $1 == '_' ? "font-style:italic"
                              : "text-decoration:line-through";

        return  '<span style="' + style + ';">' + $2 + '</span>'
    } );
}

Because it replaces the outer delimiters first, there should never be any spans inserted inside delimiters.

Further explanation on request.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top