As a preamble: I will indeed look at HtmlUnit as suggested by @Sage.
In the meantime: I have come up with the following solution:
a) HtmlCleaner actually has a DomSerializer for converting to XHtml:
public static Document toXhtml(String html) throws ParserConfigurationException {
HtmlCleaner cleaner = new HtmlCleaner();
TagNode tagNode = cleaner.clean(html);
DomSerializer domSerializer = new DomSerializer(new CleanerProperties());
return domSerializer.createDOM(tagNode);
}
b) At the point that we have XHtml we have plenty of options- just use xalan for example..