Reading PDF documents where text can't be stripped
I've attached a document that I'm trying to parse using PDFBox. I'm not having any problems with the PDF Box set up. Text stripping works fine and I can dump pages to images if I need to (For those curious, this is the article that I'm working from https://support.smartbear.com/articles/testcomplete/testing-pdf-files-with-testcomplete).
My problem is this: Only the header and footer information on each page actually strips out as text. When I scan the page using getResources(), the only objects that come back are PDXFormObjects, not images.
Now, I can go the route of doing image comparisons and such... but I'd rather not... image comparisons are so bulky and prone to problems with pixel depth, tolerance, etc. So, what I need is some way to get to the contents of the PDF that I cannot access otherwise.
Any help from someone who has done this sort of thing before would be greatly appreciated. :)
See? Even heroes need help every once in a while. ;)
OK, folks... Here's what I ended up with. Picture.Find ended up generating too many false postivies. The tolerance level ended up being too much of a pain to mess with and it was not an "exact science" to find that sweet spot.
So, instead of Find, I ended up using Picture.Difference. I took the ENTIRE page out of the PDF, converted to BMP, and compared it to a baseline BMP file using a mask for those parts of the document that may vary from run to run. The result was supreme.
Note that I'm using the 1.8.6 version of PDFBox and the 1.8.0_191 version of the Java JVM client. I've copied the code from the PDFBox documentation indicated in the OP for loadDocument and convertPageToPicture. There are a couple of assumed directories where we're storing the actual PDF file so you may need to edit for your particular file structure. But here's the code. Hope this helps someone else.
function PDFRegionCheckPoint(sourceFileName, pageIndex, searchPictureFileName, fileMask){ var docObj, imagePage, imageTest, imageMask, result; docObj = loadDocument(sourceFileName); imagePage = convertPageToPicture(docObj, pageIndex, Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp'); imageTest = Utils.Picture; imageMask = Utils.Picture; imageTest.LoadFromFile(searchPictureFileName); imageMask.LoadFromFile(fileMask); result = imagePage.Difference(imageTest, false, 0, true, 0, imageMask ); if (result == null){ Log.Checkpoint('Image area found in page export', '', 300, null); } else{ Log.Warning('Unable to find speficied image in file', 'SourceFile = ' + sourceFileName + ', Page number = ' + pageIndex + ', Picture to search = ' + searchPictureFileName, pmNormal, Log.CreateNewAttributes(), result ); } if (aqFileSystem.Exists(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp')){ aqFileSystem.DeleteFile(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp') } }