I see two solutions for your problem, a generic and an ad-hoc one.
1 Generic
To get content from a website you can remove boilerplate code using tools such as boilerpipe. This will result into getting text extracted by the library. However you have pretty much no control on what's going on inside boilerpipe.
2 Ad-Hoc
You can use Jsoup to remove the unwanted nodes in the tree. For this purpose you get the document processed by Jsoup :
Document doc = Jsoup.parse(html):
Then use Jsoup selectors to get the nodes you want to remove from the pages. See documentation here : Jsoup selectors. Once the nodes selected, use the remove method from the Element class.