Domanda

My application downloads a certain website as HTML file the first time it is started. The HTML file is very messy ofcourse, so I want to clean it with HtmlCleaner, so that I can then parse it with Jsoup. But how do I get a new cleaned html item after it was cleaned?

I did some research and this is all i could find:

HtmlCleaner htmlCleaner = new HtmlCleaner();

TagNode root = htmlCleaner.clean(url);

HtmlCleaner.getInnerHtml(root);

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";

But I can't see where in this code does it write to a new file? If it doesn't, how do I implement it so that the old file will be deleted and the new cleaned html file will be created?

È stato utile?

Soluzione

you can do something like following:

HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.themoscowtimes.com/";

TagNode node = cleaner.clean(new URL(siteUrl));


// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
    node , "cleaned.xml", "utf-8"
);

or

// serialize to html file
SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(htmlCleaner.getProperties());
serializer.writeToFile(node, "c:/temp/cleaned.html");
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top