QRegExp for HTML Image Tags

https://stackoverflow.com/questions/17689906

03-06-2022
|

Question

First off, I just want to say that I understand that it is a bad idea to use regexs for HTML. I'm just using it to grab <img> tag info, so I don't care about nesting, etc.

That being said, I'm trying to get the src URLs for all images in a web page. However, I seem to only be getting the first result. Is it my regex, or is it the way I'm using it? My regex skills are a bit rusty, so I might be missing something obvious.

QRegExp imgTagRegex("(<img.*>)+", Qt::CaseInsensitive); //Grab the entire <img> tag
imgTagRegex.setMinimal(true);
imgTagRegex.indexIn(pDocument);
QStringList imgTagList = imgTagRegex.capturedTexts();
imgTagList.removeFirst();   //the first is always the total captured text

foreach (QString imgTag, imgTagList) //now we want to get the source URL
{
    QRegExp urlRegex("src=\"(.*)\"", Qt::CaseInsensitive);
    urlRegex.setMinimal(true);
    urlRegex.indexIn(imgTag);
    QStringList resultList = urlRegex.capturedTexts();
    resultList.removeFirst();
    imageUrls.append(resultList.first());
}

By the time I hit the foreach loop, the imgTagList contains only 1 string. For the "Cats in Ancient Egypt" wikipedia page, it contains:

<img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/13/Egypte_louvre_058.jpg/220px-Egypte_louvre_058.jpg" width="220" height="407" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/13/Egypte_louvre_058.jpg/330px-Egypte_louvre_058.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/1/13/Egypte_louvre_058.jpg 2x" />

Which is what I want, but I know there are more image tags on the page...any ideas why I'm only getting the first back?

Update

With help from Sebastian Lange, I was able to get this far:

QRegExp imgTagRegex("<img.*src=\"(.*)\".*>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlMatches;
QStringList imgMatches;
int offset = 0;
while(offset >= 0)
{
    offset = imgTagRegex.indexIn(pDocument, offset);
    offset += imgTagRegex.matchedLength();

    QString imgTag = imgTagRegex.cap(0);
    if (!imgTag.isEmpty())
        imgMatches.append(imgTag); // Should hold complete img tag

    QString url = imgTagRegex.cap(1);
    if (!url.isEmpty())
    {
        url = url.split("\"").first(); //ehhh....
        if (!urlMatches.contains(url))
            urlMatches.append(url); // Should hold only src property
    }
}

The split at the end is a hacky way of getting rid of the non-src elements in the <img> tag, since it looks like I'm unable to get just the data inside the src="..." segment. It works, but it's only because I can't get the right way of doing it working. I also added some stuff to standardize the

Solution

QRegExp generally just gives one match. The list capturedTexts() gives all captures for this one match! You can have more than one capture bracket in one regex statement. To solve your problem you will need to do something like:

QRegExp imgTagRegex("\\<img[^\\>]*src\\s*=\\s*\"([^\"]*)\"[^\\>]*\\>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlmatches;
QStringList imgmatches;
int offset = 0;
while( (offset = imgTagRegex.indexIn(pDocument, offset)) != -1){
    offset += imgTagRegex.matchedLength();
    imgmatches.append(imgTagRegex.cap(0)); // Should hold complete img tag
    urlmatches.append(imgTagRegex.cap(1)); // Should hold only src property
}

EDIT: changed capture RegExpression to "\\<img[^\\>]*src=\"([^\"]*)\"[^\\>]*\\>" EDIT2: added possible spaces in src string: "\\<img[^\\>]*src\\s*=\\s*\"([^\"]*)\"[^\\>]*\\>"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow