Forum Discussion

rwendt's avatar
rwendt
Regular Visitor
8 years ago

How to compare PDF files using File.Compare() using the HashValue return?

I am trying to test a web application which generates a PDF download, namely testing that the PDF is being made correctly. I would like the test to compare this generated PDF to a reference file to ensure that the program is correctly making the document.

 

This appears to be possible using the Files.Compare method. As expected, the same files were not seen as identical by the method because the PDFs store information like time of creation, which differs file to file. However, part of using this method is being able to specify a hash value, essentially allowing the documents to differ by a certain amount. I was hoping to be able to establish a hash value to account for small changes file to file, but I have been unable to do so. 

 

I have compared two PDFs that are identical (except for the time of creation/last modified properties) and when Files.Compare(file1, file2) is used it will return false (as expected) but will also return a hash value of about 2 billion (the log reads "HashValue = -1913419433"). Also worth mentioning that these PDFs are only a couple pages, nothing enormous. On the other hand, if I compare two completely different PDFs the same compare method will return false (again, as expected) but instead will have a significantly smaller hash value (about 3 million). Also, these hash values are returned as negative values which seemed odd.

 

After reading the explanation that SmartBear provided on the Compare method I thought that the larger a hash value was the more differences there were between two files. But I seem to be observing the opposite.  

 

So, is using this method a viable way of comparing PDF files? Also, is there a possible explanation as to why similar (practically identical) files return a hash value far larger than two entirely different files? And finally, I would really like to know if this apparently large hash value of around 2 billion is actually a good number that allows for small changes between two almost identical files rather than just being a big enough number that any two files would return as identical.

 

Thank you!

1 Reply

  • maximojo's avatar
    maximojo
    Frequent Contributor

    I'm not sure of the hash issues but see this thread for comparing pdf files. pdfbox would be the way to go. 

     

    https://community.smartbear.com/t5/Desktop-Testing/Compare-pdf-files/m-p/122339

     

    Comparing at the byte level with Files.Compare would be useful when you know the files should be EXACTLY the same (including time stamps, etc) but I'd guess it could show up false perhaps randomly(?) when you're checking at such a fine resolution.