
I have a script which makes submits a POST request via cURL to an external site and expects to receive a file in response. However on an error the site will return an HTML error page instead of the expected file.

I have the response stored in a string and I would like to check if the string contains an HTML page, if not, we can assume the string contains the requested file data.

I am having trouble creating a regex to test if the string is an HTML page. I would like to test the following:

  • The data has a leading opening HTML tag: <\s*html.*>

  • The data has a subsequent opening body tag: <\s*body.*>

  • The data has a subsequent closing body tag: <\/\s*body.*>

  • The data has a subsequent closing HTML tag: <\/\s*html.*>

I tried the following:

function isHTMLPage($data) {
  $html_file_regex = '/<\s*html.*>.*<\s*body.*>.*<\/\s*body.*>.*.<\/\s*html.*>/';
  return preg_match($html_file_regex, strtolower($data)) === 1;

The function returns false (fails to match) on the following test data:

<!DOCTYPE html>
<html xmlns="">
<title>Test Page</title>
<div>test Content</div>

What is wrong with my regex?


Was it helpful?


. does not match newlines, unless you use the "dotall" modifier: s

That said, you shouldn't be doing this. What you should do instead is check for a status code, such as 404 to indicate that the file wasn't found. After all, what if the file you are expecting to get is an HTML file itself?


Use the s (PCRE_DOTALL) modifier:

$html_file_regex = '/<\s*html.*>.*<\s*body.*>.*<\/\s*body.*>.*.<\/\s*html.*>/s';

According to the PHP manual, “If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded.”

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top