質問

//Parse an HTML file into text while preserving carriage returns

StringBuffer temp = new StringBuffer(html);    
final StringBuffer sb = new StringBuffer();//this will be my output
    HTMLEditorKit.ParserCallback parserCallback = new 
                                    HTMLEditorKit.ParserCallback() {
        public boolean readyForNewline;
        @Override
        public void handleText(final char[] data, final int pos) {
            String s = new String(data);
            sb.append(s.trim() + " ");
            readyForNewline = true;
        }
        @Override
        public void handleStartTag(final HTML.Tag t, 
                                  final MutableAttributeSet a, 
                                  final int pos) {
            if (readyForNewline && 
                              (t == HTML.Tag.DIV || t == HTML.Tag.BR || 
                               t == HTML.Tag.P || t == HTML.Tag.TR)) {
                sb.append("\n");
                readyForNewline = false;
            }
        }
        @Override
        public void handleSimpleTag(final HTML.Tag t, 
                                            final MutableAttributeSet a, 
                                            final int pos) {
            handleStartTag(t, a, pos);
        }
    };
try {
    new ParserDelegator().parse(new StringReader(temp.toString()),
                parserCallback, false);
} catch (IOException e) {
    return null;
}

This code works fine on small html files, but when I try to parse a ~4MB HTML file that has been converted to a string, it throws an IOException and I have no idea why? It's right in that try loop, took me a while to find it since the console doesn't print the error.

Basically this code is meant to take HTML files and strip away tags while preserving line spacing. I found this code on SO and am borrowing it, alternative solutions are fine too but out of JSoup and many others, this is the only one that achieves what I want (on small files anyway). Is there any reason this code would throw an IOException when the file is too big? Methods to fix that?

Thanks a ton!

EDIT: Here's the stack

javax.swing.text.ChangedCharSetException
    at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(Unknown Source)
    at javax.swing.text.html.parser.Parser.startTag(Unknown Source)
    at javax.swing.text.html.parser.Parser.parseTag(Unknown Source)
    at javax.swing.text.html.parser.Parser.parseContent(Unknown Source)
    at javax.swing.text.html.parser.Parser.parse(Unknown Source)
    at javax.swing.text.html.parser.DocumentParser.parse(Unknown Source)
    at javax.swing.text.html.parser.ParserDelegator.parse(Unknown Source)
    at org.SmartTable.SmartTable.htmlToText(SmartTable.java:293)
    at org.SmartTable.SmartTable.<init>(SmartTable.java:35)
役に立ちましたか?

解決

new ParserDelegator().parse(new StringReader(temp.toString()), parserCallback, true);

// change the last "false" to true to ignore charset

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top