Forum Discussion

Esteemed Contributor

7 years ago

Solved

Reading PDF documents where text can't be stripped

I've attached a document that I'm trying to parse using PDFBox. I'm not having any problems with the PDF Box set up. Text stripping works fine and I can dump pages to images if I need to (For those...

tristaanogre

7 years ago

OK, folks... Here's what I ended up with. Picture.Find ended up generating too many false postivies. The tolerance level ended up being too much of a pain to mess with and it was not an "exact science" to find that sweet spot.

So, instead of Find, I ended up using Picture.Difference. I took the ENTIRE page out of the PDF, converted to BMP, and compared it to a baseline BMP file using a mask for those parts of the document that may vary from run to run. The result was supreme.

Note that I'm using the 1.8.6 version of PDFBox and the 1.8.0_191 version of the Java JVM client. I've copied the code from the PDFBox documentation indicated in the OP for loadDocument and convertPageToPicture. There are a couple of assumed directories where we're storing the actual PDF file so you may need to edit for your particular file structure. But here's the code. Hope this helps someone else.

function PDFRegionCheckPoint(sourceFileName, pageIndex, searchPictureFileName, fileMask){
    var docObj, imagePage, imageTest, imageMask, result;
    docObj = loadDocument(sourceFileName);
    imagePage = convertPageToPicture(docObj, pageIndex, Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp');
    imageTest = Utils.Picture;
    imageMask = Utils.Picture;
    imageTest.LoadFromFile(searchPictureFileName);
    imageMask.LoadFromFile(fileMask);
    result = imagePage.Difference(imageTest, false, 0, true, 0, imageMask );
    if (result == null){
        Log.Checkpoint('Image area found in page export', '', 300, null);
    }
    else{
        Log.Warning('Unable to find speficied image in file', 'SourceFile = ' + sourceFileName + ', Page number = ' + pageIndex + ', Picture to search = ' + searchPictureFileName, pmNormal, Log.CreateNewAttributes(), result );   
    }
    if (aqFileSystem.Exists(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp')){
        aqFileSystem.DeleteFile(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp')
    }
}

AlexKaras

Community Hero

7 years ago

Hi Robert,

tristaanogre wrote:

My problem is this: Only the header and footer information on each page actually strips out as text.

Might it be because of the document security (File | Properties (Ctrl-D) -> Security tab) been set to not allow page extraction?

P.S. I did a quick try and got the same result as you have had.

tristaanogre

Esteemed Contributor

7 years ago

My developers say that they aren't applying any special security to the document... I personally call bull-funky on that... but there doesn't appear in their code anything that explicitly blocks this.

AlexKaras
Community Hero
7 years ago
tristaanogre wrote:

but there doesn't appear in their code anything that explicitly blocks this.

It might be some (default) setting of the html to pdf converter that was used to produce the sample pdf.
- tristaanogre
  Esteemed Contributor
  7 years ago
  Probably. Unfortunately, in this case, this isn't a high priority for them to investigate. It doesn't cause issues for the end user, it just makes it trickier for me to test.
  
  I've managed a workaround using Picture.Find method to find sub sections in the PDF.
  - tristaanogre
    Esteemed Contributor
    7 years ago
    OK, folks... Here's what I ended up with. Picture.Find ended up generating too many false postivies. The tolerance level ended up being too much of a pain to mess with and it was not an "exact science" to find that sweet spot.
    
    So, instead of Find, I ended up using Picture.Difference. I took the ENTIRE page out of the PDF, converted to BMP, and compared it to a baseline BMP file using a mask for those parts of the document that may vary from run to run. The result was supreme.
    
    Note that I'm using the 1.8.6 version of PDFBox and the 1.8.0_191 version of the Java JVM client. I've copied the code from the PDFBox documentation indicated in the OP for loadDocument and convertPageToPicture. There are a couple of assumed directories where we're storing the actual PDF file so you may need to edit for your particular file structure. But here's the code. Hope this helps someone else.
    
    function PDFRegionCheckPoint(sourceFileName, pageIndex, searchPictureFileName, fileMask){ var docObj, imagePage, imageTest, imageMask, result; docObj = loadDocument(sourceFileName); imagePage = convertPageToPicture(docObj, pageIndex, Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp'); imageTest = Utils.Picture; imageMask = Utils.Picture; imageTest.LoadFromFile(searchPictureFileName); imageMask.LoadFromFile(fileMask); result = imagePage.Difference(imageTest, false, 0, true, 0, imageMask ); if (result == null){ Log.Checkpoint('Image area found in page export', '', 300, null); } else{ Log.Warning('Unable to find speficied image in file', 'SourceFile = ' + sourceFileName + ', Page number = ' + pageIndex + ', Picture to search = ' + searchPictureFileName, pmNormal, Log.CreateNewAttributes(), result ); } if (aqFileSystem.Exists(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp')){ aqFileSystem.DeleteFile(Project.ConfigPath + '\\PDFS\\Page' + pageIndex + '.bmp') } }

Forum Discussion

Reading PDF documents where text can't be stripped

Recent Discussions

TC Plugin 2.10 on Java 21 ClassNotFoundException:javax.xml.bind.DatatypeConverter blocks result pub

Dynamic Text Fields

TC latest version

Related Content

Reading PDF documents

Strip Property Transfer

Performance issue when reading large number of objects