Validating HTML input using libxml

This post was originally made on the Assanka blog. Assanka was acquired by the Financial Times in January 2012, and became what is now FT Labs. Learn more.

It’s often the case that as web developers, we need to ‘clean’ input from end users to ensure it does not contain any nasty formatting or script that we don’t want to allow on our sites. Forums in particular often suffer from either security holes that allow cross site scripting attacks (XSS) or are so restrictive in what they allow to be input that it causes a nuisance to the user (for example, disallowing all HTML but allowing BBCode instead).

This problem is often solved with complex classes or functions in PHP that are designed to strip out the nasty stuff while allowing as much useful formatting as possible. We realised that these functions are pretty much just reinventing the wheel, because there is already a pretty good mechanism for parsing and validating XML syntax: libxml, which has PHP bindings and can be accessed using SimpleXML.

What’s more, libxml can parse an XML document for conformance to a DTD, so if you include an XHTML Transitional DTD in your XML code string, you can check that the markup is valid XHTML.

Here’s the PHP to do this. This is tested on PHP 5.3 with libxml2-2.6.26-2.1.2.8.

function isXML($str) {
	libxml_use_internal_errors(true);
	libxml_clear_errors();
	$options = (strpos($str, '<!DOCTYPE') !== false) ? (LIBXML_DTDLOAD + LIBXML_DTDVALID) : 0;
	simplexml_load_string($str, 'SimpleXMLElement', $options);
	$errors = libxml_get_errors();
	return (empty($errors) or $errors[0]->level == LIBXML_ERR_WARNING) ? true : false;
}

You could of course use the contents of $errors to feed back to the user, or potentially deal with a validation failure more intelligently, but for now true or false will do.

So the markup submitted by a user is valid. Excellent. But just because the markup is valid doesn’t mean it’s safe to output to the browser. You’ll also want to ensure it contains no <script type="text/javascript"> sections or event handlers, and may want to restrict the set of elements available. This is where you can start getting creative with your own DTD spec. Just start with the standard you want to conform to for the whole page (say XHTML) and strip out anything you don’t like.

We’ll start by removing the HEAD tag and all its contents. Our users will not be writing entire documents, just fragments of body markup, so we don’t want a HEAD, TITLE, or any META tags, etc.

You can continue, removing things like SCRIPT, OBJECT, forms, frames, and so on. Be careful where elements are defined using presets, which often contain the nasties, for example the %event set of attributes grants an element the ability to fire event handlers. Fortunately this is almost exclusively used as part of %attrs, so we can just remove it from that superset.

We’ll also define a new root element fragment_under_test to ensure that we don’t cause any confusion and lead anyone to believe that they’re writing a normal <html> or <body>.

Once we’re done, we can then wrap the isXML function in a convenience function that adds our new custom DTD.

function isXHTMLFragment($str) {
	return isXML("<!DOCTYPE fragment_under_test system "http://www.example.com/dtds/xhtml-content-restrictive.dtd"><fragment_under_test>".$str."</fragment_under_test>");
}

If you want, feel free to download the DTD I created for this article.

Now you can use the fast libxml to validate user input in a fairly bulletproof way.

Finally, and very importantly, make sure you cache the schemas on your server in an XML catalog file. If you don’t do this, libxml will make an external HTTP request for the DTD schema file every time you call the function. In fact, since most web documents cite W3C DTDs, they are having enormous problems with software making repeated requests for the standard XHTML, HTML 4 etc DTDs which haven’t changed in years. Be a good net citizen, and cache your schemas. In this case we’re writing and hosting our own anyway, but if you’re using a public schema you may as well save yourself the pointless HTTP traffic, and it’ll speed up the validation as well.