First off, I just want to say that I understand that it is a bad idea to use regexs for HTML. I'm just using it to grab <img>
tag info, so I don't care about nesting, etc.
That being said, I'm trying to get the src
URLs for all images in a web page. However, I seem to only be getting the first result. Is it my regex, or is it the way I'm using it? My regex skills are a bit rusty, so I might be missing something obvious.
QRegExp imgTagRegex("(<img.*>)+", Qt::CaseInsensitive); //Grab the entire <img> tag
imgTagRegex.setMinimal(true);
imgTagRegex.indexIn(pDocument);
QStringList imgTagList = imgTagRegex.capturedTexts();
imgTagList.removeFirst(); //the first is always the total captured text
foreach (QString imgTag, imgTagList) //now we want to get the source URL
{
QRegExp urlRegex("src=\"(.*)\"", Qt::CaseInsensitive);
urlRegex.setMinimal(true);
urlRegex.indexIn(imgTag);
QStringList resultList = urlRegex.capturedTexts();
resultList.removeFirst();
imageUrls.append(resultList.first());
}
By the time I hit the foreach
loop, the imgTagList
contains only 1 string. For the "Cats in Ancient Egypt" wikipedia page, it contains:
<img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/13/Egypte_louvre_058.jpg/220px-Egypte_louvre_058.jpg" width="220" height="407" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/13/Egypte_louvre_058.jpg/330px-Egypte_louvre_058.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/1/13/Egypte_louvre_058.jpg 2x" />
Which is what I want, but I know there are more image tags on the page...any ideas why I'm only getting the first back?
Update
With help from Sebastian Lange, I was able to get this far:
QRegExp imgTagRegex("<img.*src=\"(.*)\".*>", Qt::CaseInsensitive);
imgTagRegex.setMinimal(true);
QStringList urlMatches;
QStringList imgMatches;
int offset = 0;
while(offset >= 0)
{
offset = imgTagRegex.indexIn(pDocument, offset);
offset += imgTagRegex.matchedLength();
QString imgTag = imgTagRegex.cap(0);
if (!imgTag.isEmpty())
imgMatches.append(imgTag); // Should hold complete img tag
QString url = imgTagRegex.cap(1);
if (!url.isEmpty())
{
url = url.split("\"").first(); //ehhh....
if (!urlMatches.contains(url))
urlMatches.append(url); // Should hold only src property
}
}
The split
at the end is a hacky way of getting rid of the non-src elements in the <img>
tag, since it looks like I'm unable to get just the data inside the src="..."
segment. It works, but it's only because I can't get the right way of doing it working. I also added some stuff to standardize the