Question

I have the following from a website which I am scraping but unfortunately the content also contains some font tags and could possibly contain other inline formatting in the future. I'm using PHPQuery to do this but a PHP only solution also works.

<p>
<font
color="#cc0000">
    <font
    color="#000000">Content</font>
        </font>
</p>
<p>Content</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#000000">Content I wish to keep but font should be removed</font>
</p>
<p>Content</p>
</div>
Was it helpful?

Solution

Use strip_tags();

strip_tags ($str, '<p><div>');

this line will remove all the tags but P and DIV You can add more alowable tags to second argument.

Example from php.net

 <?php
 $text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
 echo strip_tags($text);
 echo "\n";

 // Allow <p> and <a>
 echo strip_tags($text, '<p><a>');
 ?>

The above example will output:

Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>

OTHER TIPS

Assuming the content is in a variable:

$content = strip_tags( $str, '<p><div>' );

The phpQuery way might look like this:

while($font = $dom->find('font')->eq(0)){
  if(0 === $font->size()) break; // to prevent a strange phpquery bug
  $font->replaceWith($font->text());
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top