Estrarre il testo del corpo di un documento HTML usando PHP

https://stackoverflow.com/questions/4910975

29-10-2019
|

Domanda

So che è meglio usare DOM per questo scopo, ma proviamo a estrarre il testo in questo modo:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

Il risultato può essere visto qui: http://ideone.com/vh2fz

Come puoi vedere, ricevo più testo del previsto.

C'è qualcosa che non capisco, per ottenere la lunghezza corretta per il substr($string, $start, $length) funzione, sto usando:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

Non vedo niente di sbagliato in questa formula.

Qualcuno potrebbe gentilmente suggerire dove si trova il problema?

Molte grazie a tutti voi.

MODIFICARE:

Grazie mille a tutti voi. C'è solo un bug nel mio cervello. Dopo aver letto le tue risposte, ora capisco qual è il problema, dovrebbe essere:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

Nessuna soluzione corretta

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow