質問

I want to extract text which is inside some tags like <dt>, <dd>, etc. from HTML files using Apache Tika.

So I am writing custom ContentHandler which is supposed to extract information from these tags.

My custom ContentHandler code looks like below. It is not yet complete but its already not working as expected :

public class TableContentHandler implements ContentHandler {

    // key = abbreviation
    // value = information / description for abbreviation
    private Map<String, String> abbreviations = new HashMap<String, String>();

    // current abbreviation
    private String abbreviation = null;

    // <dd> element contains abbreviation. So this boolean variable will be set when
    // <dd> element is found
    private boolean ddElementStarted = false;

    // this method is not giving contents within <dd> and </dd> tags
    public void characters(char[] chars, int arg1, int arg2) throws SAXException {
            if(ddElementStarted) {
                    System.out.println("chars found...");
            }
    }

    // set boolean ddElementStarted to true to indicate that content handler found 
    // <dd> element
    public void startElement(String arg0, String element, String arg2, Attributes arg3) throws SAXException {
            if(element.equalsIgnoreCase("dd")) {
                    ddElementStarted = true;
            }
    }
}

Here my assumption is that as soon as content handler goes inside startElement() method and element name is dd then I will set ddElementStarted = true and then to get contents inside <dd> and </dd> element, I will check in characters() method.

In characters() method I am checking if ddElementStarted = true and chars array will contents within <dd> and </dd> element, but it is not working :(

I would like to know if

  1. Am I going in correct direction?
  2. Is this the proper way to parse HTML using Tika? Or is there any other way?
  3. Should I choose another HTML parsing API like JSoup? I just need information from couple of tags like, I am not interested in rest of the HTML page.
  4. Is there any way to specify XPath expressions in Apache Tika? I am not able to find this information in Tika in Action book.
役に立ちましたか?

解決

The simple solution is Jsoup. Easily we can get the values inside any tag. So instead of writing new ContentHandler just use JSoup to parse.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top