Domanda

I am trying to crawl the archives of a local news paper and am getting the desired result. Is there any way for me to program the crawler such that the static buttons such as the Home, Button and their footers which are the same on every, page not be included in the crawl

This is the code I am using to display the crawled data

System.out.println(Jsoup.parse(html).body().text_mod());
È stato utile?

Soluzione

I see two solutions for your problem, a generic and an ad-hoc one.

1 Generic

To get content from a website you can remove boilerplate code using tools such as boilerpipe. This will result into getting text extracted by the library. However you have pretty much no control on what's going on inside boilerpipe.

2 Ad-Hoc

You can use Jsoup to remove the unwanted nodes in the tree. For this purpose you get the document processed by Jsoup :

Document doc = Jsoup.parse(html):

Then use Jsoup selectors to get the nodes you want to remove from the pages. See documentation here : Jsoup selectors. Once the nodes selected, use the remove method from the Element class.

Altri suggerimenti

What about the shouldVisit method? You can add conditions based on URL patterns, for example:

    @Override
public boolean shouldVisit(WebURL url) {
    String href = url.getURL().toLowerCase();
    return (!href.contains("static/button/url/"));
}

That works for me.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top