Question

We are using Tika 1.1 to extract content from an XLSM file. We have two instances of our server. On one of the servers the file content is getting extracted properly. But on another server I am getting zip bomb exception for the same file. We are using same tika standalone jar at both instances. But I am not able to identify the issue.

Not sure whether the SAX configuration is creating a problem at runtime (I am not well versed with SAX). How can I debug this issue?

Caused by: org.apache.tika.exception.TikaException: Zip bomb detected! at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:123) at org.apache.tika.Tika.parseToString(Tika.java:380) at com.ptc.search.solr.contentReader.contentExtraction.TikaExtractor.getContent(TikaExtractor.java:36) ... 45 more Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 878 levels of XML element nesting at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:234) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:244) at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:274) at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:313) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.extractHeaderFooter(XSSFExcelExtractorDecorator.java:145) at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:129) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:104) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:110) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ... 47 more

Was it helpful?

Solution

After debugging the tika code I realized that I had set maxStringLength on WriteOutContentHandler and the code was throwing zip bomb error after the limit is reached. Correct error message might have helped sooner. Anyway, thanks all for the input. We will definitely plan to move to latest release.

Should we create a defect in Jira to throw proper error message?

OTHER TIPS

I resolved this problem installing

emerge app-office/unoconv

and executing

$ unoconv -fpdf file.xlsm

It will create a .pdf file in the same directory of the file, then you can send it to Tika.

My server is Gentoo, then adapt to your dist.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top