Getting paragraph count from Tika for both Word and PDF

https://stackoverflow.com/questions/14990619

10-03-2022
|

Question

I have a scenario where I need to reconcile two documents, an Word (.docx) doc as well as a PDF. The two are supposed to be "indentical" to each other (the PDF is just a PDF version of the DOCX file); meaning they should contain the same text, content, etc.

Specifically, I need to make sure that both documents contain the same number of paragraphs. So I need to read the DOCX, get the paragraph count, then read the PDF and grab its paragraph count. If both numbers are the same, then I'm in business.

It looks like Apache Tika (I'm interested in 1.3) is the right tool for the job here. I see in this source file that Tika supports the notion of paragraph counting, but trying to figure out how to get the count from both documents. Here's my best attempt but I'm choking on connecting some of the final dots:

InputStream docxStream = new FileInputStream("some-doc.docx");
InputStream pdfStream = new FileInputStream("some-doc.pdf");

ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
Parser parser = new OfficeParser();
ParseContext pc = new ParseContext();

parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);

docxStream.close();
pdfStream.close();

int docxParagraphCount = docxMeta.getXXX(???);
int pdfParagraphCount = pdfMeta.getXXX(???);

if(docxParagraphCount == pdfParagraphCount)
    setInBusiness(myself, true);

So I ask: have I set this up correctly or am I way off base? If off-base, please lend me some help to get me back on track. And if I have set things up correctly, then how do I get the desired counts out of the two Metadata instances? Thanks in advance.

Solution

First up, Tika will only give you back the metadata contained within your documents. It won't compute anything for you. So, if one of your documents lacks the paragraph count metadata, you're out of luck. If one of your documents has duff data (i.e. the program that wrote the file out got it wrong), you'll be out of luck.

Otherwise, your code is nearly there, but not quite. You most likely want to use DefaultParser or AutoDetectParser - OfficeParser is for the Microsoft file formats only, while the others automatically load all the available parsers and pick the correct one.

The property you want is PARAGRAPH_COUNT, which comes from the Office metadata namespace. Your code would be something like:

TikaInputStream docxStream = TikaInputStream.get(new File("some-doc.docx"));
TikaInputStream pdfStream = TikaInputStream.get(new File("some-doc.pdf"));

ContentHandler handler = new DefaultContentHandler();
Metadata docxMeta = new Metadata();
Metadata pdfMeta = new Metadata();
ParseContext pc = new ParseContext();

Parser parser = TikaConfig.getDefaultConfig().getParser();

parser.parse(docxStream, handler, docxMeta, pc);
parser.parse(pdfStream, handler, pdfMeta, pc);

int docxParagraphCount = docxMeta.getInt(Office.PARAGRAPH_COUNT);
int pdfParagraphCount = pdfMeta.getInt(Office.PARAGRAPH_COUNT);

If you don't care about the text at all, only the metadata, pass in a dummy content handler

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow