How to use a regex test if a string contains an HTML page

https://stackoverflow.com/questions/20976251

25-09-2022
|

Question

I have a script which makes submits a POST request via cURL to an external site and expects to receive a file in response. However on an error the site will return an HTML error page instead of the expected file.

I have the response stored in a string and I would like to check if the string contains an HTML page, if not, we can assume the string contains the requested file data.

I am having trouble creating a regex to test if the string is an HTML page. I would like to test the following:

The data has a leading opening HTML tag: <\s*html.*>
The data has a subsequent opening body tag: <\s*body.*>
The data has a subsequent closing body tag: <\/\s*body.*>
The data has a subsequent closing HTML tag: <\/\s*html.*>

I tried the following:

function isHTMLPage($data) {
  $html_file_regex = '/<\s*html.*>.*<\s*body.*>.*<\/\s*body.*>.*.<\/\s*html.*>/';
  return preg_match($html_file_regex, strtolower($data)) === 1;
}

The function returns false (fails to match) on the following test data:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test Page</title>
</head>
<body>
<div>test Content</div>
</body>
</html>

What is wrong with my regex?

/<\s*html.*>.*<\s*body.*>.*<\/\s*body.*>.*.<\/\s*html.*>/

Solution

. does not match newlines, unless you use the "dotall" modifier: s

That said, you shouldn't be doing this. What you should do instead is check for a status code, such as 404 to indicate that the file wasn't found. After all, what if the file you are expecting to get is an HTML file itself?

OTHER TIPS

Use the s (PCRE_DOTALL) modifier:

$html_file_regex = '/<\s*html.*>.*<\s*body.*>.*<\/\s*body.*>.*.<\/\s*html.*>/s';

According to the PHP manual, “If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded.”

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow