Extracting the body text of an HTML document using PHP
-
29-10-2019 - |
سؤال
I know it's better to use DOM for this purpose but let's try to extract the text in this way:
<?php
$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;
preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);
if (empty($matches))
exit;
$matched_body_start_tag = $matches[0][0];
$index_of_body_start_tag = $matches[0][1];
$index_of_body_end_tag = strpos($html, '</body>');
$body = substr(
$html,
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
);
echo $body;
The result can be seen here: http://ideone.com/vH2FZ
As you can see, I am getting more text than expected.
There is something I don't understand, to get the correct length for the substr($string, $start, $length)
function, I am using:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
I don't see anything wrong with this formula.
Could somebody kindly suggest where the problem is?
Many thanks to you all.
EDIT:
Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));
Or:
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);
لا يوجد حل صحيح
لا تنتمي إلى StackOverflow