Question

i have users submit some text (including random html image links) and then i am trying to create a basic BBCode [img][/img] tag out of the images in the text.

The way i am currently testing is like so:

String (taken from a random forum):

After a fair few years of doing the usual lowering, fitting wheels etc,when it comes to car modifying, we spent a couple of years doing Minimoto racing all round the country in the Southern British Minimoto Championship winning the 2006 Production Privateer Championship.<br /> <br /> <img src="http://i2.photobucket.com/albums/y18/moo0484/scan0001.jpg" border="0" class="tcattdimglink" onload="NcodeImageResizer.createOn(this);" alt="" /><br /> <br /> <img src="http://i2.photobucket.com/albums/y18/moo0484/01072007065.jpg" border="0" class="tcattdimglink" onload="NcodeImageResizer.createOn(this);" alt="" /><br />

I then replace any image properties / change the image tags to bbcode using a function:

function convert($text) {
  $text = preg_replace('/class=".*?"/', '', $text);
  $text = preg_replace('/alt=".*?"/', '', $text);
  $text = preg_replace('/src="/', '', $text);
  $text = preg_replace('/border=".*?"/', '', $text);
  $text = preg_replace('/onload=".*?"/', '', $text);
  $text = str_replace("<img", "[img]", "$text");
  $text = str_replace('">', "[/img]", "$text");
  return nl2br($text);
}

This works perfectly fine if the tag is not closed with a trailing slash. I could add another rule:

  $text = str_replace('"/>', "[/img]", "$text");

Which would work, but there is the white space still left from where i removed the properties.

So my question is, can i just remove the white space from between the img tags:

  <img />

For example, in the preg_replace function the .*? replaces the content between the "".

Can i do a similar thing but with img tags and to remove the white space between them?

I obviously cant just run:

  $text = preg_replace('/\s+/', '', $text);

As i need the white space in the text etc.

Thanks!

Was it helpful?

Solution

You should remove any white space and rouge attributes so pretty much all attributes especially the on* Event Attributes like onClick,onBlur. theres are too many ways to add a XSS attack into HTML.Making something that will strip them all out would not be maintainable, so if you want to let users input HTML use htmlpurifier. Its easily initialized into your code and has lots of options.

A simple alternative would be to just extract the src of the img then remove the attributes and put the src back and make a string of images, then use strip_tags() to remove all HTML and then concatenate your images onto the text. It lacks the positioning of images though.

So something like:

<?php 
$html = <<<DEMO
After a fair <script>alert('XSS');</script>few ...
winning the 2006 Production Privateer Championship.<br /> 
<div style="background-image: url(javascript:alert('XSS'))"></div>
<br /> 
<img src="http://i2.photobucket.com/albums/y18/moo0484/scan0001.jpg" border="0" class="tcattdimglink" onload="NcodeImageResizer.createOn(this);" alt="" /><br /> 
<br /> 
text here
<img src="http://i2.photobucket.com/albums/y18/moo0484/01072007065.jpg" border="0" class="tcattdimglink" onload="NcodeImageResizer.createOn(this);" alt="" /><br />
more txt here
DEMO;

$dom = new DOMDocument;
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

if (false === ($elements = $xpath->query("//*"))) die('Error');

foreach ($elements as $element) {

    //remove script tags
    if($element->nodeName=='script'){
        $element->parentNode->removeChild($element);
    }

    //remove empty tags but not images
    if (!$element->hasChildNodes() || $element->nodeValue == '') {
        if($element->nodeName != 'img'){
            $element->parentNode->removeChild($element);
        }
    }

    //remove all attributes except links and imgs
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if (('img' === $element->nodeName && 'src' === $name) || ('a' === $element->nodeName && 'href' === $name)){
            continue;
        }
        $element->removeAttribute($name);
    }
}

//put dom together and remove the document body
echo preg_replace('~<(?:!DOCTYPE|/?(?:html|body))[^>]*>\s*~i', '', $dom->saveHTML());

/*
<p>After a fair few ...
winning the 2006 Production Privateer Championship.</p>
<img src="http://i2.photobucket.com/albums/y18/moo0484/scan0001.jpg"> 
text here
<img src="http://i2.photobucket.com/albums/y18/moo0484/01072007065.jpg">
more txt here
*/

Though just look into using htmlpurifier, also the 1990's are calling they want there BBCODE back use markdown instead. ;p

Good luck

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top