Question

I am using HTML purifier to remove all unnecessary/malicious html tags.

$html = 'dirty html provided by user';
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Allowed', 'p,a[href], ... other tags);
$purifier = new HTMLPurifier($config);;
$output = $purifier->purify($html);

It works really nice, but I want to do a little bit more. I want to change all my <a href='link'>...</a> to something else like <a href='somefunc(link)' rel="nofollow" target="_blank"> ... </a>.

After searching for a little bit, I found the following relevant link, but the problem is that it requires patching a complex library (which is not really a good idea, also the solution is kind of complicated).

Reading through their forum post, it looks like there is solution for adding nofollow parameter is $config->set("HTML.Nofollow", true);, but I still fail to find how can modify every link.

My current solution is to parse purified html by myself and to modify a link, but I think that there is a way to do this through HTML Purifier.

Was it helpful?

Solution

Actually I found partial solution on one of the links on the forum.

This is what I need to do:

$config->set('HTML.Nofollow', true);
$config->set('HTML.TargetBlank', true);

So the full thing looks like this:

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.Nofollow', true);
$config->set('HTML.TargetBlank', true);
$config->set('HTML.Allowed', 'a,b,strong,i,em,u');
$purifier = new HTMLPurifier($config);

OTHER TIPS

Htmlpurifier offers an API for URL mangling.

See http://htmlpurifier.org/docs/enduser-uri-filter.html

Basically you create a filter class like

class HTMLPurifier_URIFilter_MyPostFilter extends HTMLPurifier_URIFilter
{
    public $name = 'MyPostFilter';
    public $post = true;
    public function prepare($config) {}
    public function filter(&$uri, $config, $context) {
        // ... extra code here
    }
}

You do your magic in the filter function. Have a look in the documentation for the semantics of the url object that gets passed.

You can then activate the filter with

$uri = $config->getDefinition('URI');
$uri->addFilter(new HTMLPurifier_URIFilter_MyPostFilter(), $config);

You can use preg_replace(). The regex would be:

/<a href='(\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|])'>([a-zA-Z0-9\s._\-]*)<\/a>/

So the function would be:

$pattern = "/<a href='(\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|])'>([a-zA-Z0-9\s._\-]*)<\/a>/";
$replacement = "<a href='$1' rel='nofollow' target='_blank'>$2</a>";
$html = preg_replace($pattern, $replacement, $html);

Also if you want to do something with the url, the replacement string would be:

$replacement = "<a href='".somefunction("$1")."' rel='nofollow' target='_blank'>$2</a>";

The regex explain and examples.

Edit: Adding attributes to links in HTML Purifier:

$def = $config->getHTMLDefinition(true);
$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');

More about adding attributes in HTML Purifier

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top