Extraer el texto del cuerpo de un documento HTML usando PHP

https://stackoverflow.com/questions/4910975

29-10-2019
|

Pregunta

Sé que es mejor usar DOM para este propósito, pero intentemos extraer el texto de esta manera:

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

El resultado se puede ver aquí: http://ideone.com/vh2fz

Como puede ver, estoy recibiendo más texto de lo esperado.

Hay algo que no entiendo, para obtener la longitud correcta para el substr($string, $start, $length) función, estoy usando:

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

No veo nada malo con esta fórmula.

¿Alguien podría sugerir dónde está el problema?

Muchas gracias a todos.

EDITAR:

Muchas gracias a todos ustedes. Solo hay un error en mi cerebro. Después de leer sus respuestas, ahora entiendo cuál es el problema, debería ser:

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow