Forum Discussion
Hello,
I don't know on how to use OCR, can you please guide me on it. I am using PDFBox to retreive the text from the PDF file. I am able to retrieve it but I am unable to compare it as it is not correctly organized as mentioned previously.
Kindly provide step by step instructions to do so.
Regards,
Nimish
- nimishbhuta6 years agoFrequent Contributor
Hello,
Based on the below article for OCR, as I understand that it works by taking the picture of the window and then the it validates the content inside the window. In my case, I have opened the PDF file in the browser window and to use the OCR get text method but it is not capturing the content.
I don't know if I am missing something here. Please can you open any PDF file in your IE or chrome browser and provide the actual steps for capturing it.
Also one question, will the OCR file works by provding the PDF file name as argument to retrieve the text?
Regards,
Nimish
- Marsha_R6 years agoChampion Level 3
I suggest that you contact Support directly about this. They can help you select the best way to test your PDF. Here's the link:
- tristaanogre6 years agoEsteemed Contributor
I'm looking into doing PDF testing myself right now. One thing that the PDFBox offers is the ability to break up the text of the PDF document into pages and, within the pages, breaking it up into paragraphs. It MIGHT be possible that you can find the specific information you want to validate by referencing a particular paragraph ID within the page that you're testing. Investigate, based upon the documentation for TestComplete and PDFBox, whether that will work for you.
- nimishbhuta6 years agoFrequent Contributor
Hello,
Thanks for your response. I was going through the doc and tried using paragraph feature but the thing is that need to know the start paragraph and end paragraph. I tried entering specific text as mentioned in the text but not luck. I am not sure how can we know the paragraph id. If you are working on PDF and come across how to obtain paragraph then do share your code.
Another approach, I was thinking to use using PDFTextStripperByArea which helps to mark the area and retrieve the text. But somehow this calss is not supported by TestComplete as it is requires pdfbox 2.0 and it is not available in pdfbox 1.8.12
Here is the example
https://www.programcreek.com/java-api-examples/?api=org.apache.pdfbox.util.PDFTextStripperByArea
Regards,
Nimish
- tristaanogre6 years agoEsteemed Contributor
You can probably use PDFBox 2.0... but keep in mind that you'll need to make sure you have the proper version of the JRE and that the methods and properties available may be different than what is in that article. You can give it a try... there's not "explicit" thing in PDFBox that prevents you from using a more recent version.
As for "knowing the paragraph"... you know the PDF. You have access to your "baseline" of what the PDF is. So, just write a bit of "throw-away" code to cycle through all the paragraphs in your desired page to find the ones you want and then utilize those ID's in your actual test code... that's my intent with my own project at least.
Related Content
- 11 months agoKimdoengart
- 2 years agomattb
- 2 years agoTaz
Recent Discussions
- 2 days agoMW_Didata