Question

  

It hardly seems like html code that needs purifying.

Why does htmlpurifier turn that string into a question mark when it should obviously be a space?

My exact html purification code is:

//purify the html input
include_once('inc/htmlpurifier-4.4.0/library/HTMLPurifier.auto.php');

$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'HTML 4.01 Transitional');

if (defined('PURIFIER_CACHE')) {
    $config->set('Cache.SerializerPath', PURIFIER_CACHE);
} else {
    # Disable the cache entirely
    $config->set('Cache.DefinitionImpl', null);
}

$input = $_POST["about_me"];

# Help out the Purifier a bit, until it develops this functionality
while (($cleaner = preg_replace('!<(em|strong)>(\s*)</\1>!', '$2', $input)) != $input) {
    $input = $cleaner;
}

$filter = new HTMLPurifier($config);
$htmlpurified_output = $filter->purify($input);

I have utf8 enabled in my php page headers and also for mysql when saving the information.

I am able to write, save to DB, and re-display other UTF8 characters inside other textareas on the same page. The culprit is definitely htmlpurifier returning the question marks in place of actual characters.

I will answer any other questions I can.

Was it helpful?

Solution

And the answer is...

To always make sure your encoding is properly set in all areas.

I had the "about_me" row of the table only set to accept ascii characters. Duh.

Sorry for wasting everybody's time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top