Question

I have created a small java test project locally in my NetBeans IDE (7.4 on Mac OSX) in order to extract content and meta data from various files.

I've tried to extract PDF, TXT, and PPT, and the only Meta data I'm getting back is "Content-Type". I have tried both InputStream, and the new TikaInputStream, but have had no success so far.

I have compiled the 1.4 version of Tika, and added tika-parsers-1.4.jar and tika-core-1.4.jar to the project.

Hope someone can spot the obvious

    public static void TikaExtract(String fileName) throws Exception {

        TikaInputStream tikaStream = TikaInputStream.get(new File(fileName));

        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();

        Parser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(tikaStream, textHandler, metadata, context);

        //Check if there is anything in tikaStream
        out.println("File Length: " + tikaStream.getLength());

        out.println("Title: " + metadata.get("title"));
        out.println("Content type: " + metadata.get("Content-Type"));
        out.println("Author: " + metadata.get("Author"));
        out.println("content: " + textHandler.toString());

        System.out.println(tikaStream.toString());
        tikaStream.close();

}

Output from the above code (with data/sample.pdf as input) looks like this:

File Length: 730808

Title: null

Content type: application/pdf

Author: null

content:

TikaInputStream of data/sample.pdf

Était-ce utile?

La solution

Found a working solution, though probably not the ideal one.

Replace all the current libraries (not using Maven) with tika-server-1.4.jar

Please feel free to comment.

Autres conseils

Using tika-server instead of tika-core solved the problem for me, too. I was able to do this using Maven, via Grape.

That is, simply replacing:

@Grab(group='org.apache.tika', module='tika-core', version='1.4')

with:

@Grab(group='org.apache.tika', module='tika-server', version='1.4')

worked.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top