Question

I am using FileUtils to compare two identical pdfs. This is the code:

boolean comparison = FileUtils.contentEquals(pdfFile1, pdfFile2);

Despite the fact that both pdf files are identical, I keep getting false. I also noticed that when I execute:

byte[] byteArray = FileUtils.readFileToByteArray(pdfFile1);
byte[] byteArrayTwo = FileUtils.readFileToByteArray(pdfFile2);
System.out.println(byteArray);
System.out.println(byteArrayTwo);

I get the following bytecode for the two pdf files:

[B@3a56f631
[B@233d28e3

So even though both pdf files are absolutely identical visually, their byte-code is different and hence failing the boolean test. Is there any way to test whether the identical pdf files are identical?

Was it helpful?

Solution

Unfortunately for PDF there is a big difference between having "identical files" and having files that are "visually identical". So the first question is what you are looking for.

One very simple example, information in a PDF file can be compressed or not, and can be compressed with different compression filters. Taking a file where some of the content is not compressed, and compressing that content with a ZIP compression filter for example, would give you two files that are very different on a byte level, yet very much the same visually.

So you can do a number of different things to compare PDF files:

1) If you want to check whether you have "the same file", read them in and calculate some sort of checksum as answered before by Peter Petrov.

2) If you want to know whether or know files are visually identical, the most common method is some kind of rendering. Render all pages to images and compare the images. In practice this is not as simple as it sounds and there are both simple (for example callas pdfToolbox) and complex (for example Global Vision DigitalPage) applications that implement some kind of "sameness" algorithm (caution, I'm related to both of those vendors).

So define very well what exactly you need first, then choose carefully which approach would work best.

OTHER TIPS

Yes, generate md5 sum from both files.

See if these sums are identical.

If they are, then your files are identical
too with a certainty which is practically 100%.

If the sums are not identical, then
your files are different for sure.

To generate the md5 sums, on Linux there's an md5sum
command, for Windows there's a small tool called fciv.

http://www.microsoft.com/en-us/download/details.aspx?id=11533

Just to note, the two identifiers you wrote

[B@3a56f631
[B@233d28e3

are different because they belong to two different objects. These are object identifiers, not bytecode. Two objects can be logically equal even if they are not exactly the same objects (e.g. they have different objectIDs).

Otherwise, calculating an MD5 checksum as peter.petrov wrote is a good idea.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top