Question
I am looking for good methods of manipulating HTML in PHP. For example, the problem I currently have is dealing with malformed HTML.
I am getting input that looks something like this:
<div>This is some <b>text
As you noticed, the HTML is missing closing tags. I could use regex or an XML Parser to solve this problem. However, it is likely that I will have to do other DOM manipulation in the future. I wonder if there are any good PHP libraries that handle DOM manipulation similar to how Javascript deals with DOM manipulation.
Solution
PHP has a PECL extension that gives you access to the features of HTML Tidy. Tidy is a pretty powerful library that should be able to take code like that and close tags in an intelligent manner.
I use it to clean up malformed XML and HTML sent to me by a classified ad system prior to import.
OTHER TIPS
I've found PHP Simple HTML DOM to be the most useful and straight forward library yet. Better than PECL I would say.
I've written an article on how to use it to scrape myspace artist tour dates (just an example.) Here's a link to the php simple html dom parser.
The DOM library which is now built-in can solve this problem easily. The loadHTML method will accept malformed XML while the load method will not.
$d = new DOMDocument;
$d->loadHTML('<div>This is some <b>text');
$d->saveHTML();
The output will be:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<div>This is some <b>text</b></div>
</body>
</html>
For manipulating the DOM i think that what you're looking for is this. I've used to parse HTML documents from the web and it worked fine for me.