Mimetype check using Tika jars

https://stackoverflow.com/questions/22225813

10-06-2023
|

Domanda

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files.

My code look like

Parser parser= new AutoDetectParser();
InputStream stream = new FileInputStream(fileAttachment);
int writerHandler =-1;
ContentHandler contentHandler= new BodyContentHandler(writerHandler);
Metadata metadata= new Metadata();
parser.parse(stream, contentHandler, metadata, new ParseContext());
String mimeType = metadata.get(Metadata.CONTENT_TYPE);
logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);

This code is not working properly for the office 03 and 07 documents.

While running from eclipse I am getting correct mimetypes.

I build jar file and running from command its giving wrong mimetypes.

out put from command
------------
File Attachment: Testpdf.pdf  MimeType is: application/pdf
File Attachment: Testpdf.tif  MimeType is: image/tiff
File Attachment: Testpdf.xlsx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xltx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.pptx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.docx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xls  MimeType is: application/zip
File Attachment: Testpdf.doc  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.dot  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.ppt  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.xlt  MimeType is: application/vnd.ms-excel

I tried with OfficePraser, OOXMLParser. Its not working. I have tried with tika 0.9 jar files. mimeTypes are coming correctly but if any one of my file attachment is "editable pdf" my batch process is dying (like "exit(0);" in code). If I have new tika jars its giving wrong mimeTypes.

Please help me in this. Thanks in advance.

CVSR Sarma

Soluzione

Firstly, you're using the wrong bit of Apache Tika. If all you want to know is the file type, then you should use the Detection API (javadocs) directly, eg:

TikaConfig tika = new TikaConfig();

Metadata metadata = new Metadata();
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
String mimetype = tika.getDetector().detect(stream, metadata);

If you have only the tika-core jar on your classpath, then the detection above will use Mime Magic and Filename hints. That'll let it get most files, especially if they have the right extension, but it'll struggle only wrongly named "container formats"

Container Formats are things like zip, ole2 etc, where one file format can hold many types (eg ods, xlsx, keynote all use .zip, .doc and .xls both use ole2). If you want to do detection that looks inside containers for more accurate results, you need to also include the tika-parsers-standard jar and its dependencies.

Note that, as explained in the Javadocs, your stream needs to support mark and reset for detection to work. This is so that Tika can read the first bit of your stream, look at it to work out what your file is, then return the stream to how it was ready for other uses (eg parsing). Most streams should, but if yours doesn't, the simplest way to fix it is to wrap it in a TikaInputStream via TikaInputStream.get, which sorts all that out for you

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow