Reading PDF documents where text can't be stripped

Question

I've attached a document that I'm trying to parse using PDFBox.&nbsp; I'm not having any problems with the PDF Box set up.&nbsp; Text stripping works fine and I can dump pages to images if I need to (For those curious, this is the article that I'm working from&nbsp;https://support.smartbear.com/articles/testcomplete/testing-pdf-files-with-testcomplete).
&nbsp;
My problem is this: Only the header and footer information on each page actually strips out as text.&nbsp; When I scan the page using getResources(), the only objects that come back are PDXFormObjects, not images.&nbsp;&nbsp;
Now, I can go the route of doing image comparisons and such... but I'd rather not... image comparisons are so bulky and prone to problems with pixel depth, tolerance, etc.&nbsp; So, what I need is some way to get to the contents of the PDF that I cannot access otherwise.Any help from someone who has done this sort of thing before would be greatly appreciated. :)
&nbsp;
See?&nbsp; Even heroes need help every once in a while. ;)

tristaanogre · Accepted Answer

OK, folks...&nbsp; Here's what I ended up with.&nbsp;&nbsp;Picture.Find ended up generating too many false postivies.&nbsp; The tolerance level ended up being too much of a pain to mess with and it was not an "exact science" to find that sweet spot.
&nbsp;
So, instead of Find, I ended up using&nbsp;Picture.Difference.&nbsp; I took the ENTIRE page out of the PDF, converted to BMP, and compared it to a baseline BMP file using a mask for those parts of the document that may vary from run to run.&nbsp; The result was supreme.
&nbsp;
Note that I'm using the 1.8.6 version of PDFBox and the 1.8.0_191 version of the Java JVM client.&nbsp; I've copied the code from the PDFBox documentation indicated in the OP for&nbsp;loadDocument and&nbsp;convertPageToPicture.&nbsp; There are a couple of assumed directories where we're storing the actual PDF file so you may need to edit for your particular file structure.&nbsp; But here's the code.&nbsp; Hope this helps someone else.
function PDFRegionCheckPoint(sourceFileName, pageIndex, searchPictureFileName, fileMask){
    var docObj, imagePage, imageTest, imageMask, result;
    docObj = loadDocument(sourceFileName);
    imagePage = convertPageToPicture(docObj, pageIndex, Project.ConfigPath + '\PDFS\Page' + pageIndex + '.bmp');
    imageTest = Utils.Picture;
    imageMask = Utils.Picture;
    imageTest.LoadFromFile(searchPictureFileName);
    imageMask.LoadFromFile(fileMask);
    result = imagePage.Difference(imageTest, false, 0, true, 0, imageMask );
    if (result == null){
        Log.Checkpoint('Image area found in page export', '', 300, null);
    }
    else{
        Log.Warning('Unable to find speficied image in file', 'SourceFile = ' + sourceFileName + ', Page number = ' + pageIndex + ', Picture to search = ' + searchPictureFileName, pmNormal, Log.CreateNewAttributes(), result );   
    }
    if (aqFileSystem.Exists(Project.ConfigPath + '\PDFS\Page' + pageIndex + '.bmp')){
        aqFileSystem.DeleteFile(Project.ConfigPath + '\PDFS\Page' + pageIndex + '.bmp')
    }
}

tristaanogre · Answer

Probably.&nbsp; Unfortunately, in this case, this isn't a high priority for them to investigate.&nbsp; It doesn't cause issues for the end user, it just makes it trickier for me to test.
&nbsp;
I've managed a workaround using&nbsp;Picture.Find method to find sub sections in the PDF.&nbsp;&nbsp;

jab4743 · Answer

I have never worked with the PDF functionality.&nbsp; If the PDF functionality is not 'seeing' the text, have you tried using the OCR functionality in 12.6?&nbsp;&nbsp; If OCR 12.6 does not work, you can take a look at the freeware Tesseract but Tesseract takes an enormous amount of set up to get working.

tristaanogre · Answer

Unfortunately, the OCR in TC 12.60 requires additional licensing to be able to use. I don't have the authority to OK the purchase of the license. So, I'm using PDFBox as it's been a tried and true tool for TC users for some time, hoping others here may have some insight into navigation the PDFBox objects to get me what I need.

I have implemented, in the short term, an image comparison work around but, as mentioned, I'd prefer text to text comparison.

jab4743 · Answer

I hate when others suggest options I can't use but just in case rolling back to 12.5 is an option for you, the OCR in 12.5 does not require additional licensing and I am told - will be supported again in 12.7.&nbsp; I use the OCR in 12.5 all the time and it seems to be stable.

alexkaras · Answer

Hi Robert,
&nbsp;
tristaanogre&nbsp;wrote:
&nbsp;
My problem is this: Only the header and footer information on each page actually strips out as text.&nbsp;

Might it be because of the document security (File | Properties (Ctrl-D) -&gt; Security tab)&nbsp;been set to not allow page extraction?
&nbsp;
P.S. I did a quick try and got the same result as you have had.
&nbsp;

Forum Discussion

Reading PDF documents where text can't be stripped

8 Replies

Related Content

Reading PDF documents

Strip Property Transfer

OCR service failed to process the document

Documentation Landing Page

FAQ: Is there any built in feature that would enable an end-user reading the documentation to comment?

Recent Discussions

Floating window bleep overlapping window takes longer time (~10 min) to click the button on it.

Test case randomly fails with "parent object was not found"

Does TC have a .within() method similar to Cypress.io?

SessionCreator /Timeout option not firing EventControl_OnTimeout

testcomplete feature Intelligent Quality Add Ons