Javascript Regex Subgroups

https://stackoverflow.com/questions/9357289

28-10-2019
|

Question

First off, don't link to the "Don't parse HTML with Regex" post :)

I've got the following HTML, which is used to display prices in various currencies, inc and ex tax:

<span id="price_break_12345" name="1">
    <span class="price">
        <span class="inc" >
            <span class="GBP">£25.00</span>
            <span class="USD" style="display:none;">$34.31</span>
            <span class="EUR" style="display:none;">27.92&nbsp;€</span>
        </span>
        <span class="ex"  style="display:none;">
            <span class="GBP">£20.83</span>
            <span class="USD" style="display:none;">$34.31</span>
            <span class="EUR" style="display:none;">23.27&nbsp;€</span>
        </span>
    </span>
    <span style="display:none" class="raw_price">25.000</span>
</span>

An AJAX call returns a single string of HTML, containing multiple copies of the above HTML, with the prices varying. What I'm trying to match with regex is:

Each block of the above HTML (as mentioned, it occurs multiple times in the return string)
The value of the name attribute on the outermost span

What I have so far is this:

var price_regex = new RegExp(/(<span([\s\S]*?)><span([\s\S]*?)>([\s\S]*?)<\/span><\/span\>)/gm);
console && console.log(price_regex.exec(product_price));

It matches the first price break once for each price break that occurs (so if there's name=1, name=5 and name=15 it matches name=1 3 times.

Whereabouts am I going wrong?

Solution 2

Thanks in large part to jfriend for making me realise why my regex was matching in a strange way (while (price_break = regex.exec(string)) instead of just exec'ing it once), I've got it working:

var price_regex = new RegExp(/<span[\s\S]*?name="([0-9]+)"[\s\S]*?><span[\s\S]*?>[\s\S]*?<\/span><\/span\>/gm);
var price_break;
while (price_break = price_regex.exec(strProductPrice))
{
    console && console.log(price_break);
}

I had a ton of useless () which were just clogging up the result set, so stripping them out made things a lot simpler.

The other thing, as mentioned above was that originally I was just doing

price_break = price_regex.exec(strProductPrice)

which runs the regex once, and returns the first match only (which I mistook for returning 3 copies of the first match, due to the ()s). By looping over them, it keeps evaluating the regex until all the matches have been exhausted, which I assumed it did normally, similar to PHP's preg_match.

OTHER TIPS

So, if you can count on the format of that first span in each block like this:

<span id="price_break_12345" name="1">

Then, how about you use code like this to cycle through all the matches. This code identifies the price_break_xxxx id value in that first span and then picks out the following name attribute:

var re = /id="price_break_\d+"\s+name="([^"]+)"/gm;
var match;
while (match = re.exec(str)) {
    console.log(match[1]);
}

You can see it work here: http://jsfiddle.net/jfriend00/G39ne/.

I used a converter to make three of your blocks of HTML into a single javascript string (to simulate what you get back from your ajax call) so I could run the code on it.

A more robust way to do this is to just use the browser's HTML parser to do all the work for you. Assuming you have the HTML in a string variable named `str', you can use the browser's parser like this:

function getElementChildren(parent) {
    var elements = [];
    var children = parent.childNodes;
    for (var i = 0, len = children.length; i < len; i++) {
        // collect element nodes only
        if (children[i].nodeType == 1) {
            elements.push(children[i]);
        }
    }
    return(elements);
}

var div = document.createElement("div");
div.innerHTML = str;
var priceBlocks = getElementChildren(div);
for (i = 0; i < priceBlocks.length; i++) {
    console.log(priceBlocks[i].id + ", " + priceBlocks[i].getAttribute("name") + "<br>");
}

Demo here: http://jsfiddle.net/jfriend00/F6D8d/

This will leave you with all the DOM traversal functions for these elements rather than using (the somewhat brittle) regular expressions on HTML.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow