Question

I am using simple_html_dom to parse a website. Is there a way to extract the doctype?

Was it helpful?

Solution

You can use file_get_contents function to get all HTML data from website. For example

<?php
   $html = file_get_contents("http://google.com");
   $html = str_replace("\n","",$html);
   $get_doctype = preg_match_all("/(<!DOCTYPE.+\">)<html/i",$html,$matches);
   $doctype = $matches[1][0];
?>

OTHER TIPS

You can use $html->find('unknown'). This works - at least - in version 1.11 of the simplehtmldom library. I use it as follows:

function get_doctype($doc)
{
    $els = $doc->find('unknown');

    foreach ($els as $e => $el) 
        if ($el->parent()->tag == 'root') 
            return $el;

    return NULL;
}

That's just to handle any other 'unknown' elements which might be found; I'm assuming the first will be the doctype. You can explicitly inspect ->innertext if you want to ensure it starts with '!DOCTYPE ', though.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top