Ask a Question

Reading PDF documents where text can't be stripped

SOLVED
tristaanogre
Esteemed Contributor

Reading PDF documents where text can't be stripped

I've attached a document that I'm trying to parse using PDFBox.  I'm not having any problems with the PDF Box set up.  Text stripping works fine and I can dump pages to images if I need to (For those curious, this is the article that I'm working from https://support.smartbear.com/articles/testcomplete/testing-pdf-files-with-testcomplete).

 

My problem is this: Only the header and footer information on each page actually strips out as text.  When I scan the page using getResources(), the only objects that come back are PDXFormObjects, not images.  

Now, I can go the route of doing image comparisons and such... but I'd rather not... image comparisons are so bulky and prone to problems with pixel depth, tolerance, etc.  So, what I need is some way to get to the contents of the PDF that I cannot access otherwise.

Any help from someone who has done this sort of thing before would be greatly appreciated. 🙂

 

See?  Even heroes need help every once in a while. 😉


Robert Martin
[Hall of Fame]
Please consider giving a Kudo if I write good stuff
----

Why automate?  I do automated testing because there's only so much a human being can do and remain healthy.  Sleep is a requirement.  So, while people sleep, automation that I create does what I've described above in order to make sure that nothing gets past the final defense of the testing group.
I love good food, good books, good friends, and good fun.

Mysterious Gremlin Master
Vegas Thrill Rider
Extensions available
8 REPLIES 8
jab4743
Contributor

I have never worked with the PDF functionality.  If the PDF functionality is not 'seeing' the text, have you tried using the OCR functionality in 12.6?   If OCR 12.6 does not work, you can take a look at the freeware Tesseract but Tesseract takes an enormous amount of set up to get working.

tristaanogre
Esteemed Contributor

Unfortunately, the OCR in TC 12.60 requires additional licensing to be able to use.  I don't have the authority to OK the purchase of the license.  So, I'm using PDFBox as it's been a tried and true tool for TC users for some time, hoping others here may have some insight into navigation the PDFBox objects to get me what I need.

I have implemented, in the short term, an image comparison work around but, as mentioned, I'd prefer text to text comparison.


Robert Martin
[Hall of Fame]
Please consider giving a Kudo if I write good stuff
----

Why automate?  I do automated testing because there's only so much a human being can do and remain healthy.  Sleep is a requirement.  So, while people sleep, automation that I create does what I've described above in order to make sure that nothing gets past the final defense of the testing group.
I love good food, good books, good friends, and good fun.

Mysterious Gremlin Master
Vegas Thrill Rider
Extensions available

I hate when others suggest options I can't use but just in case rolling back to 12.5 is an option for you, the OCR in 12.5 does not require additional licensing and I am told - will be supported again in 12.7.  I use the OCR in 12.5 all the time and it seems to be stable.

AlexKaras
Champion Level 3

Hi Robert,

 


@tristaanogre wrote:

 

My problem is this: Only the header and footer information on each page actually strips out as text. 


Might it be because of the document security (File | Properties (Ctrl-D) -> Security tab) been set to not allow page extraction?

 

P.S. I did a quick try and got the same result as you have had.

 

Regards,
  /Alex [Community Champion]
____
[Community Champions] are not employed by SmartBear Software but
are just volunteers who have some experience with the tools by SmartBear Software
and a desire to help others. Posts made by [Community Champions]
may differ from the official policies of SmartBear Software and should be treated
as the own private opinion of their authors and under no circumstances as an
official answer from SmartBear Software.
The [Community Champion] signature is assigned on quarterly basis and is used with permission by SmartBear Software.
https://community.smartbear.com/t5/Community-Champions/About-the-Community-Champions-Program/gpm-p/252662
================================
tristaanogre
Esteemed Contributor

My developers say that they aren't applying any special security to the document...  I personally call bull-funky on that... but there doesn't appear in their code anything that explicitly blocks this.


Robert Martin
[Hall of Fame]
Please consider giving a Kudo if I write good stuff
----

Why automate?  I do automated testing because there's only so much a human being can do and remain healthy.  Sleep is a requirement.  So, while people sleep, automation that I create does what I've described above in order to make sure that nothing gets past the final defense of the testing group.
I love good food, good books, good friends, and good fun.

Mysterious Gremlin Master
Vegas Thrill Rider
Extensions available


@tristaanogre wrote:

but there doesn't appear in their code anything that explicitly blocks this.


It might be some (default) setting of the html to pdf converter that was used to produce the sample pdf.

 

Regards,
  /Alex [Community Champion]
____
[Community Champions] are not employed by SmartBear Software but
are just volunteers who have some experience with the tools by SmartBear Software
and a desire to help others. Posts made by [Community Champions]
may differ from the official policies of SmartBear Software and should be treated
as the own private opinion of their authors and under no circumstances as an
official answer from SmartBear Software.
The [Community Champion] signature is assigned on quarterly basis and is used with permission by SmartBear Software.
https://community.smartbear.com/t5/Community-Champions/About-the-Community-Champions-Program/gpm-p/252662
================================
tristaanogre
Esteemed Contributor

Probably.  Unfortunately, in this case, this isn't a high priority for them to investigate.  It doesn't cause issues for the end user, it just makes it trickier for me to test.

 

I've managed a workaround using Picture.Find method to find sub sections in the PDF.  


Robert Martin
[Hall of Fame]
Please consider giving a Kudo if I write good stuff
----

Why automate?  I do automated testing because there's only so much a human being can do and remain healthy.  Sleep is a requirement.  So, while people sleep, automation that I create does what I've described above in order to make sure that nothing gets past the final defense of the testing group.
I love good food, good books, good friends, and good fun.

Mysterious Gremlin Master
Vegas Thrill Rider
Extensions available

OK, folks...  Here's what I ended up with.  Picture.Find ended up generating too many false postivies.  The tolerance level ended up being too much of a pain to mess with and it was not an "exact science" to find that sweet spot.

 

So, instead of Find, I ended up using Picture.Difference.  I took the ENTIRE page out of the PDF, converted to BMP, and compared it to a baseline BMP file using a mask for those parts of the document that may vary from run to run.  The result was supreme.

 

Note that I'm using the 1.8.6 version of PDFBox and the 1.8.0_191 version of the Java JVM client.  I've copied the code from the PDFBox documentation indicated in the OP for loadDocument and convertPageToPicture.  There are a couple of assumed directories where we're storing the actual PDF file so you may need to edit for your particular file structure.  But here's the code.  Hope this helps someone else.


function PDFRegionCheckPoint(sourceFileName, pageIndex, searchPictureFileName, fileMask){
    var docObj, imagePage, imageTest, imageMask, result;
    docObj = loadDocument(sourceFileName);
    imagePage = convertPageToPicture(docObj, pageIndex, Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp');
    imageTest = Utils.Picture;
    imageMask = Utils.Picture;
    imageTest.LoadFromFile(searchPictureFileName);
    imageMask.LoadFromFile(fileMask);
    result = imagePage.Difference(imageTest, false, 0, true, 0, imageMask );
    if (result == null){
        Log.Checkpoint('Image area found in page export', '', 300, null);
    }
    else{
        Log.Warning('Unable to find speficied image in file', 'SourceFile = ' + sourceFileName + ', Page number = ' + pageIndex + ', Picture to search = ' + searchPictureFileName, pmNormal, Log.CreateNewAttributes(), result );   
    }
    if (aqFileSystem.Exists(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp')){
        aqFileSystem.DeleteFile(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp')
    }
}

Robert Martin
[Hall of Fame]
Please consider giving a Kudo if I write good stuff
----

Why automate?  I do automated testing because there's only so much a human being can do and remain healthy.  Sleep is a requirement.  So, while people sleep, automation that I create does what I've described above in order to make sure that nothing gets past the final defense of the testing group.
I love good food, good books, good friends, and good fun.

Mysterious Gremlin Master
Vegas Thrill Rider
Extensions available
cancel
Showing results for 
Search instead for 
Did you mean: