Question

I am trying to find a specific string that contains a keyword inside a title tag in html e.g.

<title>Bla bla bla String bla bla</title>

I am unsure how to construct that beyond the starting:

\<title\>(Word Keyword)\<\/title\>

I also want to make sure if I use any wildcards regex may be able to use that the wildcard between the keyword and the doesn't inadvertently go all the way to the end of perhaps another title block in the html.

Lastly I'm trying to find a way to then

  • extract the Word Keyword only even though I've capture the entire regex
  • extract/keep the separately.

This is because I'll have several types of to captiure from and I want to extract both the 'Word Keyword' and the tag name it came from. Is this possible? I've looked into named groups but not sure if/how to extract after e.g.

(?P<TAG>(\<title\>|\<head\>)(?P<TERM>(Word Keyword))\<\/title\>

Naturally with any wildcard code as needed to make the above work but assuming it does I'd then want to be able to extract, after matching the string:

  • title
  • Bla Keyword

or

  • head
  • Yada Keyword
Was it helpful?

Solution

<(title|head).*?>(.*?)<\/\1>

Regular expression visualization

This regex would contain the tag in it's first match group, and the inner html of the tag in it's second group - but consider going with XPath or any HTML/XML parser, because of Zalgo.

You need PCRE to use this expression, because of the non-greedy wildcards.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top