Forum Discussion
Hello Marsha,
I have tried using vbscript and was able to retrieve the text from the PDF but the way text is retreived is not helping me to compare the text values.
Example
In the PDF file there is heading say Supplier and below there are some text related to supplier. In the same row, it has Ship to Address and below it has some text related to ship to address. When I try to extract the text it shows me like this
SupplierShiptoAddress
some text of supplier + some text of shiptoaddress
some text supplier+ some text of shiptoaddress
and son on ..
Please see the attached screenshot the blue lines(I have hided the text due to confidentiality) indicating text.
It is difficult to verify for the supplier text as well as shipto address as both the text are combined.
Ideally, I would require the like Supplier : all corresponding text same with ShiptoAddress. I was thinking if we can export into excel then whether we can have the text in a particular format which is easy to compare but unfortunately I dont have option to export in excel from PDF file. I tried using Paragraph extraction using pdfbox but it shows line by line which is not helping me out.
I require some way to have the correct way of extracting for comparision purpose. Is there any we can convert the pdf into excel programmitcally or any other idea which you can think of?
Regards,
Nimish
There are potential issues with OCR logic but have you tried using OCR retrievial of all the text in the PDF and then checking the text returned by the OCR logic to see if it contains the data/text in your excel? The most common issue in OCR is font smoothing - be sure to turn off font smoothing on the machine if you choose to try OCR. You can take a 'picture' of a section of the PDF document and just get the text from that picture. But then you may run into resolution and location changing issues. Best to pull all the text from the document if possible if you use OCR.
- nimishbhuta6 years agoFrequent Contributor
Hello,
I don't know on how to use OCR, can you please guide me on it. I am using PDFBox to retreive the text from the PDF file. I am able to retrieve it but I am unable to compare it as it is not correctly organized as mentioned previously.
Kindly provide step by step instructions to do so.
Regards,
Nimish
- jab47436 years agoContributor
- nimishbhuta6 years agoFrequent Contributor
Hello,
Based on the below article for OCR, as I understand that it works by taking the picture of the window and then the it validates the content inside the window. In my case, I have opened the PDF file in the browser window and to use the OCR get text method but it is not capturing the content.
I don't know if I am missing something here. Please can you open any PDF file in your IE or chrome browser and provide the actual steps for capturing it.
Also one question, will the OCR file works by provding the PDF file name as argument to retrieve the text?
Regards,
Nimish
Related Content
- 11 months agoKimdoengart
- 2 years agomattb
- 2 years agoTaz
Recent Discussions
- 2 days agoMW_Didata