Question

I use jericho parser in my application to get a lighter version of a web page, extracting some parts from it. So, for instance, when I get this code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN/" "http://www.w3.org/TR/html4/loose.dtd"><html> <head> </head> <body> <b> <span class="articletitletext">Happy New Year!</span></b> <br> <span class="postedstamp">Posted By <script language="JavaScript" type="text/javascript"> <!-- document.write('<a href="&#32;&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#99;&#104;&#114;&#105;&#115;&#46;&#119;&#121;&#109;&#97;&#110;&#64;&#118;&#101;&#114;&#105;&#122;&#111;&#110;&#46;&#110;&#101;&#116;">'); // --> </script>Chris</a> on January 1, 2012</span><br> <br> <span id="intelliTXT">

From all of us here at TheForce.net, we wish you and your family a safe and Happy New Year. May the Force be with you in 2012!

</span></body> </html>

I'd like to parse it once again using jericho parser, but when I run

ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);

I got this exception

01-01 10:46:37.518: ERROR/AndroidRuntime(648): java.lang.RuntimeException: Unable to start activity ComponentInfo{net.test.theforce/net.test.theforce.NewsListActivity}: java.lang.RuntimeException: java.lang.ClassCastException: java.util.Collections$EmptyList

and the application crashes...so, what's wrong with the lighter page?

Was it helpful?

Solution

It looks to me like the Jericho parser can parse the HTML you gave it. The error you're getting arises because you've made an incorrect assumption about what the getAllElements() method returns.

I admit I could only find the Javadoc for the zero-argument overload of this method, as opposed to the one-argument overload that you're using, so I'll have to assume that both methods return the same type, List<Element>. In your example, there are no center elements in the HTML, so the getAllElements() method should return an empty List<Element>. It doesn't have to return an ArrayList<Element> here; any implementation of List<Element> will do. In this case, it chooses to return a Collections.emptyList(). This isn't an ArrayList<Element>, and you get a ClassCastException because you cannot cast this to an ArrayList<Element>.

As far as I can see, you have two options:

  • Firstly, you might not need the returned list to be an ArrayList<Element>. It might be sufficient to use List<Element> instead. In this case, you should replace the line

    ArrayList<Element> centerElems=(ArrayList<Element>) pageSource.getAllElements(HTMLElementName.CENTER);
    

    with

    List<Element> centerElems = pageSource.getAllElements(HTMLElementName.CENTER);
    
  • Secondly, if you really do need the list to be an ArrayList<Element>, then you can create an ArrayList<Element> from the results:

    ArrayList<Element> centerElems = new ArrayList<Element>(pageSource.getAllElements(HTMLElementName.CENTER));
    
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top