I need to verify content on the first page of a pdf document. I cannot just simply extract text from the entire document. The document is to large and attempting to read all text will end up in an error. from the TC documentation on pdf to text (here), I see I can extract text from a certain section of the pdf ... but not from certain pages. Is there a way to extract text from certain pages of the pdf file ... a way to extract text from the first page of a pdf file?

Since you're not looking at the whole .pdf file, would it work just as well to see if the expected file exists in a particular folder and is a particular size? https://support.smartbear.com/testcomplete/docs/reference/program-objects/aqfile/methods.html

get text from first page of pdf file with 'PDF to Text'

5 Replies

Marsha_R
Champion Level 3
3 years ago
You just picked the page as a way to cut down the text, right? There are other ways to get just a part of the text.

https://support.smartbear.com/testcomplete/docs/testing-with/working-with-external-data-sources/pdf.html

if you use the In Script Tests example, then you could use substring to pick out the first 200 characters or whatever works for you.

If you the Extract Section Contents, then you could decide which section(s) worked for you and pick those.
- torus
  Frequent Contributor
  3 years ago
  I will give the section extraction a try. However, when I was reading the code, it seemed like both of these techniques first extracted all text from the file (with 'pdf.convertToText(pathtoPdf)). Then after it converted the entire pdf to text, you could chose to just look at a particular section. The pdf is too large and TestComplete will fail if I try reading the entire file to a string.
  
  function GetDateValuesFromPDF() { // Get the path to the tested PDF file var path = "C:\\work\\sample.pdf"; if ((path != "") && (aqFile.Exists(path)) && (aqFileSystem.GetFileExtension(path) == "pdf")) { // Get the entire file contents contents = PDF.ConvertToText(path); if (contents != "") { // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim // Post all the date values that match the specified pattern // to the test log var r = contents.match(regEx); if (r != null && r.length > 0) { for (var i = 0; i < r.length; i++ ) Log.Message(r[0]); } } } }
  - torus
    Frequent Contributor
    3 years ago
    Yes, I just verified that the below code will fail at line, " LastResult = PDF.ConvertToText(path);"
    
    function TestExtractTextFromLargePdfFile() { var Var1, LastResult, LastResult1; Var1 = ""; //Extracts plain text from a *.pdf document. var path = "C:\\temp\\LargeFile.pdf"; LastResult = PDF.ConvertToText(path); Var1 = LastResult; //Writes the specified string to a text file. LastResult1 = aqFile.WriteToTextFile("C:\\Users\\chk\\Documents\\TestCompleteOutputFile.txt", Var1, aqFile.ctUTF8); }
    the error message is, "The OCR service failed. Request Entity Too large."

Forum Discussion

get text from first page of pdf file with 'PDF to Text'

5 Replies

Recent Discussions

Unable to Navigate to Code Line from Log Messages

Microsoft Access Database Engine 2016 Redistributable is no longer supported

Storing JUnit XML while the testing is running

Related Content

xPath on text

Read text from tooltip on hover

Text Streams and Troubles