Javascript RegEx matching a string inside HTML tags

Question 1

Possibly not the most efficient, but should do the job:

str.replace(/<([^>]+)>/g, function(m){ return m.replace(/ /gi, ' '); });

Which should only touch the   inside of <>

Question 2

First off let's state again that when parsing (X)HTML with regex is the right answer, it's probably because the question is seriously messed up. In this case you should get the guy who generated the corrupted HTML, and make him put his nose in it, then make him fix the mess.

Otherwise, among other things, it will become your work, and you'll accept responsibility for any further mess.

That said, maybe the safest approach would be to look for

<([^<>]*)&nbsp;([^<>]*)>

and replace it with <\1 \2>. The downside of this approach is that you will have to do this repeatedly (if you have a tag with eight  's inside, you'll have to iterate the replacement eight times).

So you'll also need a loop that performs the replace, and if the replaced text is identical to what it was before, then you're done and may exit the loop.

This is not the most efficient way in terms of replacement speed, but it's more straightforward and simpler to handle. Also it helps in remembering that this is a kludgy fix :-)

The problem described in RoToRa's comment may be fixed in this particular case by modifying the outer expression:

<(\w[^<>]*)&nbsp;([^<>]*)>

so that it only accepts tags starting with a letter. 1 < 2   > 3 would then be rejected.

The same "fix" applies to Ross McLellan's solution:

str.replace(/<(\w[^>]+)>/g, function(m){ return m.replace(/&nbsp;/gi, ' '); });

For performance's sake, Ross's solution is faster on small HTML chunks, and falls behind mine when the number of tags grow. That's because the search overhead is marginally larger for my solution, but then mine finds far fewer matches and fewer calls to replace() are actually made.

This modification might get the best of both worlds, but I haven't tested it:

str.replace(/<(\w[^<>]*&nbsp;[^<>]*)>/g,
    function(m) {
        return m.replace(/&nbsp;/gi, ' ');
    }
);