Javascript Regex Subgroups
-
28-10-2019 - |
Question
First off, don't link to the "Don't parse HTML with Regex" post :)
I've got the following HTML, which is used to display prices in various currencies, inc and ex tax:
<span id="price_break_12345" name="1">
<span class="price">
<span class="inc" >
<span class="GBP">£25.00</span>
<span class="USD" style="display:none;">$34.31</span>
<span class="EUR" style="display:none;">27.92 €</span>
</span>
<span class="ex" style="display:none;">
<span class="GBP">£20.83</span>
<span class="USD" style="display:none;">$34.31</span>
<span class="EUR" style="display:none;">23.27 €</span>
</span>
</span>
<span style="display:none" class="raw_price">25.000</span>
</span>
An AJAX call returns a single string of HTML, containing multiple copies of the above HTML, with the prices varying. What I'm trying to match with regex is:
- Each block of the above HTML (as mentioned, it occurs multiple times in the return string)
- The value of the
name
attribute on the outermostspan
What I have so far is this:
var price_regex = new RegExp(/(<span([\s\S]*?)><span([\s\S]*?)>([\s\S]*?)<\/span><\/span\>)/gm);
console && console.log(price_regex.exec(product_price));
It matches the first price break once for each price break that occurs (so if there's name=1
, name=5
and name=15
it matches name=1
3 times.
Whereabouts am I going wrong?
Solution 2
Thanks in large part to jfriend for making me realise why my regex was matching in a strange way (while (price_break = regex.exec(string))
instead of just exec'ing it once), I've got it working:
var price_regex = new RegExp(/<span[\s\S]*?name="([0-9]+)"[\s\S]*?><span[\s\S]*?>[\s\S]*?<\/span><\/span\>/gm);
var price_break;
while (price_break = price_regex.exec(strProductPrice))
{
console && console.log(price_break);
}
I had a ton of useless ()
which were just clogging up the result set, so stripping them out made things a lot simpler.
The other thing, as mentioned above was that originally I was just doing
price_break = price_regex.exec(strProductPrice)
which runs the regex once, and returns the first match only (which I mistook for returning 3 copies of the first match, due to the ()
s). By looping over them, it keeps evaluating the regex until all the matches have been exhausted, which I assumed it did normally, similar to PHP's preg_match
.
OTHER TIPS
So, if you can count on the format of that first span in each block like this:
<span id="price_break_12345" name="1">
Then, how about you use code like this to cycle through all the matches. This code identifies the price_break_xxxx id value in that first span and then picks out the following name attribute:
var re = /id="price_break_\d+"\s+name="([^"]+)"/gm;
var match;
while (match = re.exec(str)) {
console.log(match[1]);
}
You can see it work here: http://jsfiddle.net/jfriend00/G39ne/.
I used a converter to make three of your blocks of HTML into a single javascript string (to simulate what you get back from your ajax call) so I could run the code on it.
A more robust way to do this is to just use the browser's HTML parser to do all the work for you. Assuming you have the HTML in a string variable named `str', you can use the browser's parser like this:
function getElementChildren(parent) {
var elements = [];
var children = parent.childNodes;
for (var i = 0, len = children.length; i < len; i++) {
// collect element nodes only
if (children[i].nodeType == 1) {
elements.push(children[i]);
}
}
return(elements);
}
var div = document.createElement("div");
div.innerHTML = str;
var priceBlocks = getElementChildren(div);
for (i = 0; i < priceBlocks.length; i++) {
console.log(priceBlocks[i].id + ", " + priceBlocks[i].getAttribute("name") + "<br>");
}
Demo here: http://jsfiddle.net/jfriend00/F6D8d/
This will leave you with all the DOM traversal functions for these elements rather than using (the somewhat brittle) regular expressions on HTML.