Quite easily:
$testinput = "<script>alert('p0wned');</script >\n
<a href='http://example.org' onclick=\"alert('p0Wned again!)\">Click me!</a>";
var_export(cleanInput($testinput));
Also, htmlescape
is almost always the wrong thing to use--it will mangle utf8 input. Also, you should not be storing html-escaped data in your DB. I'm not even sure why you use it here at all--won't you have to unescape the html to display it?
However you are going about this the wrong way.
- Do not parse/sanitize html with regexes. Use a real html parser such as
DOMDocument
orhtml5lib
or eventidylib
. Unfortunately PHP doesn't seem to have anything as wonderful as Bleach on Python, so you will have to roll your own. An XSLT stylesheet with a whitelist seems like it might be a good way to handle this particular sanitization condition. Update: another user pointed out HTML Purifier, which is also a whitelist-based html sanitizer. I've never used it but it looks like "Bleach in PHP". You should definitely investigate. - Prefer escaping to sanitization. PHP culture has an obsession with sanitization which is really just plain wrong. Escape data at the boundaries of your application (output and database). In the core of your application your data should be in its native form without any escaping.
A general outline of processing is like so:
Input
- Turn off magic quotes in your php settings. Include code at the top of your app to fail hard if it's on:
if (get_magic_quotes_gpc()) die ('TURN OFF MAGIC QUOTES!!!!');
- Validate and normalize/sanitize specific fields of your input according to the expected type of each field. For example, a "dollar amount" has different validation criteria than a whitelisted html fragment field. (Probably you should find and use a validation library.)
- If there are errors, send them back to the user with an appropriate HTTP response code.
- Save your data to the database using a database library that supports parameter binding, such as
PDO
library with prepared statements. This way you do not need to remember to escape data by hand. - On success, redirect (code 303) to a page displaying the created or modified record.
- Turn off magic quotes in your php settings. Include code at the top of your app to fail hard if it's on:
Output
- Retrieve data from the database.
- Feed the data to a template which is PHP code that only deals with html display of data structures. It should not know details of how that data is retrieved or contain any "application-driving" behavior. Treat a template like a function that accepts a data structure and returns a string.
Escape your data inside your template. Individual fields of your data will need to be escaped differently. You almost always need to run it through
htmlspecialchars
before output; the only case you would not do that is when the data you need to display is already html (i.e. your whitelist-sanitized html fields). Define a helper function like this and use it in your templates:function h($str) { return htmlspecialchars($str, ENT_QUOTES, 'utf-8'); }
Even better, try to use a template library that automatically escapes strings for you and that requires you to turn off escaping explicitly. (The common case should be simple to avoid errors, and having to escape is the common case!)
- Your html page is the string returned from your template. You may now display it to the user.