문제

Jsoup is a very convenient tool to parse html and used as a basic util in our crawler project. But recently I found our crawler was always doing full GC sometimes.

After dumping the object by jmap, I'm amazing to find that there are too many ParseError object. By reading source code, it's not a exception, but an object. When a html has some problem, it will be likely to cause a lot of errors. So it should be under control to prevent create object crazily.

Some detail information as follows, hope it will help you to find the solution.

   java.lang.Thread.State: RUNNABLE
        at org.jsoup.parser.Tokeniser.error(Tokeniser.java:211)
        at org.jsoup.parser.TokeniserState$47.read(TokeniserState.java:1170)
        at org.jsoup.parser.Tokeniser.read(Tokeniser.java:42)
        at org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:101)
        at org.jsoup.parser.TreeBuilder.parse(TreeBuilder.java:53)
        at org.jsoup.parser.Parser.parse(Parser.java:24)
        at org.jsoup.Jsoup.parse(Jsoup.java:44)

 num     #instances         #bytes  class name
----------------------------------------------
   1:      30110820     1204432800  org.jsoup.parser.ParseError
   2:         33076      156025088  [Ljava.lang.Object;
   3:         68836       98796360  [C
   4:         65808        9778264  <constMethodKlass>
   5:         65808        8959520  <methodKlass>
   6:         12044        8524088  [B
   7:          6424        7447912  <constantPoolKlass>
   8:        102203        5494560  <symbolKlass>
   9:          6424        4909064  <instanceKlassKlass>
  10:          5271        4171032  <constantPoolCacheKlass>
  11:        105257        3368224  java.lang.String
도움이 되었습니까?

해결책

@BalusC thanks for your hint!

After reading source code carefully, I find the trackErrors is open and no API to set it false, even more, trackErrors is useless. I fix this and republish the package, but I'm still strange about this, is it a mistake?

code1:
    private boolean trackErrors = true;

code2:
    void error(TokeniserState state) {
        if (trackErrors)
            errors.add(new ParseError("Unexpected character in input", reader.current(), state, reader.pos()));
    }
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top