Jsoup is keeping full GC because of too many ParserError object ?
문제
Jsoup is a very convenient tool to parse html and used as a basic util in our crawler project. But recently I found our crawler was always doing full GC sometimes.
After dumping the object by jmap, I'm amazing to find that there are too many ParseError object. By reading source code, it's not a exception, but an object. When a html has some problem, it will be likely to cause a lot of errors. So it should be under control to prevent create object crazily.
Some detail information as follows, hope it will help you to find the solution.
java.lang.Thread.State: RUNNABLE
at org.jsoup.parser.Tokeniser.error(Tokeniser.java:211)
at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1170)
at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
at org.jsoup.parser.Parser.parse(Parser.java:24)
at org.jsoup.Jsoup.parse(Jsoup.java:44)
num #instances #bytes class name
----------------------------------------------
1: 30110820 1204432800 org.jsoup.parser.ParseError
2: 33076 156025088 [Ljava.lang.Object;
3: 68836 98796360 [C
4: 65808 9778264 <constMethodKlass>
5: 65808 8959520 <methodKlass>
6: 12044 8524088 [B
7: 6424 7447912 <constantPoolKlass>
8: 102203 5494560 <symbolKlass>
9: 6424 4909064 <instanceKlassKlass>
10: 5271 4171032 <constantPoolCacheKlass>
11: 105257 3368224 java.lang.String
해결책
@BalusC thanks for your hint!
After reading source code carefully, I find the trackErrors is open and no API to set it false, even more, trackErrors is useless. I fix this and republish the package, but I'm still strange about this, is it a mistake?
code1:
private boolean trackErrors = true;
code2:
void error(TokeniserState state) {
if (trackErrors)
errors.add(new ParseError("Unexpected character in input", reader.current(), state, reader.pos()));
}