Question

I write regular expression to grab all ref links from html

QRegExp bodylinksrx("(<a\\s+href\\s*=\\s*[^<>]*\\s*>[^<>]*</a>)");
QStringList bodylinks;
pos = 0;
while ((pos = bodylinksrx.indexIn(htmlcode, pos)) != -1)
{
    bodylinks << bodylinksrx.cap(1);
    pos += bodylinksrx.matchedLength();
}

I recieve list as result:

("<a href="http://somehref" class="someclass">href text...</a>", "<a href="http://somehref" class="someclass">href text...</a>", ......")

But I need receive list with only "http://somehref" "href text..." "http://somehref" "href text..." ....

Was it helpful?

Solution

First off have you read this? secondly if you're sure you know what you're doing and definitely know you want to do it try using lookahead and lookbehind assertions for your anchor tags.

((?<=<a\\s+href\\s*=\\s*[^<>]*\\s*>)[^<>]*(?=</a>))

EDIT: unfortunately this will not work (at least with qt4.8) as lookbehind assertions are not supported. You could just iterate through the list created and match the desired bit with:

[^<>]*(?=<)

and use that, alternatively use the captured texts function to extract the part you want which you will surround with brackets like so:

<a\\s+href\\s*=\\s*[^<>]*\\s*>([^<>]*)</a>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top