Question

I need to be able to identify that a given file is an ODF file based on the contents of the file, and not on the file's extension.

ODF files are really a collection of XML files in a zip container, which means that I cannot use the file's magic number as it will just indicate that it is a zip file.

So what I'm really asking is are there any files that are required to be present in an ODF container? If so the presence of that file in a zip container indicates that it is likely to be an ODF file, and the absence of that file indicates that it definitely is not an ODF file.

Was it helpful?

Solution

Why not check out the ODF Technical Specification? The mimetype file listed there would probably be an ideal way to check (just look for the vnd.oasis.opendocument string in the mimetype).

OTHER TIPS

As I understand it, there will always be .xml file(s) in the root of the archive, and this/these xml files will always contain the string <office:document very near the beginning.

All those I have seen seem to contain a file called "content.xml" in the root, which does contain this string.

There are not so many applications writing ODF documents, and in the past, there was basically just one. So it shouldn't be too difficult to install some ancient version of OpenOffice, save a few files, and check that this rule applies as it does on current ODF files.

I would test with something like this on a batch of know ODF files, to check if it is reliable:

$ unzip -c $FILE content.xml | grep -q '<office:document' && echo yes || echo NO

Read the Build ID - if missing, the document is not ODF.

oDoc = ThisComponent
If oDoc.BuildID = "" Then
    bIsNotODF = TRUE
Endif
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top