Extraction du texte corporel d'un document HTML à l'aide de PHP

https://stackoverflow.com/questions/4910975

29-10-2019
|

Question

Je sais qu'il vaut mieux utiliser DOM à cet effet, mais essayons d'extraire le texte de cette manière:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

Le résultat peut être vu ici: http://ideone.com/vh2fz

Comme vous pouvez le voir, je reçois plus de texte que prévu.

Il y a quelque chose que je ne comprends pas, pour obtenir la bonne longueur pour le substr($string, $start, $length) fonction, j'utilise:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

Je ne vois rien de mal avec cette formule.

Quelqu'un pourrait-il suggérer de gentillement où se trouve le problème?

Un grand merci à vous tous.

ÉDITER:

Merci beaucoup à vous tous. Il y a juste un bug dans mon cerveau. Après avoir lu vos réponses, je comprends maintenant quel est le problème, cela devrait être: soit:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Ou:

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

Pas de solution correcte

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow