Question

I am looking for good methods of manipulating HTML in PHP. For example, the problem I currently have is dealing with malformed HTML.

I am getting input that looks something like this:

<div>This is some <b>text

As you noticed, the HTML is missing closing tags. I could use regex or an XML Parser to solve this problem. However, it is likely that I will have to do other DOM manipulation in the future. I wonder if there are any good PHP libraries that handle DOM manipulation similar to how Javascript deals with DOM manipulation.

Was it helpful?

Solution

PHP has a PECL extension that gives you access to the features of HTML Tidy. Tidy is a pretty powerful library that should be able to take code like that and close tags in an intelligent manner.

I use it to clean up malformed XML and HTML sent to me by a classified ad system prior to import.

OTHER TIPS

I've found PHP Simple HTML DOM to be the most useful and straight forward library yet. Better than PECL I would say.

I've written an article on how to use it to scrape myspace artist tour dates (just an example.) Here's a link to the php simple html dom parser.

The DOM library which is now built-in can solve this problem easily. The loadHTML method will accept malformed XML while the load method will not.

$d = new DOMDocument;
$d->loadHTML('<div>This is some <b>text');
$d->saveHTML();

The output will be:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <div>This is some <b>text</b></div>
  </body>
</html>

For manipulating the DOM i think that what you're looking for is this. I've used to parse HTML documents from the web and it worked fine for me.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top