سؤال

What I'm trying to do is get both the categories that are assigned to a certain wikipedia article and the values of the href to the assigned categories.

Example:

Given this article :

"Bread" & "Breakfast foods" are the category names and "http://en.wikipedia.org/wiki/Category:Breads" & "http://en.wikipedia.org/wiki/Category:Breakfast_foods" are the categorylinks

I'm doing this in java using 'Jerry' from the Jodd library to use JQuery in java.

I've used the following code so far to get the category names:

File file = new File(SystemUtil.getTempDir(), "temp");
NetUtil.downloadFile(url, file);
Jerry doc = Jerry.jerry(FileUtil.readString(file));
String category=doc.$("div#mw-normal-catlinks").text();

Which returns the plain text inside the catlinks div. Since this div contains a ul whichs li elements represent a single category it seems more sophisticated to iterate the list-item-elements in order to get the category names & links.

To do that I tried the following:

doc.$("div#mw-normal-catlinks").children().each(new CategoryFinder());

The idea here is to use a JerryFunction object to get the names and links for each child (each requires a JerryFunction as parameter). As you may notice I call children() on the div instead of the ul element - this is for lack of a clue how to do that.

How can I make this approach work? Also,is there another way to get the category names & links?

هل كانت مفيدة؟

المحلول

You probably should use Wikipedia API, but anyway, here is how to this with Jodd Jerry:

    File file = FileUtil.createTempFile();
    NetUtil.downloadFile("http://en.wikipedia.org/wiki/Toast", file);
    Jerry doc = Jerry.jerry(FileUtil.readString(file));
    Jerry category = doc.$("div#mw-normal-catlinks");
    category.$("ul li").each(
        new JerryFunction() {
            public boolean onNode(Jerry $this, int index) {
                System.out.println($this.text());
                return true;
            }
        });

This would print out:

Breads
Breakfast foods
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top