Ask a Question

get text from first page of pdf file with 'PDF to Text'


get text from first page of pdf file with 'PDF to Text'

I need to verify content on the first page of a pdf document. I cannot just simply extract text from the entire document. The document is to large and attempting to read all text will end up in an error. 


from the TC documentation on pdf to text (here), I see I can extract text from a certain section of the pdf ... but not from certain pages. Is there a way to extract text from certain pages of the pdf file ... a way to extract text from the first page of a pdf file?

Champion Level 3

You just picked the page as a way to cut down the text, right? There are other ways to get just a part of the text.


if you use the In Script Tests example, then you could use substring to pick out the first 200 characters or whatever works for you.


If you the Extract Section Contents, then you could decide which section(s) worked for you and pick those.

I will give the section extraction a try. However, when I was reading the code, it seemed like both of these techniques first extracted all text from the file (with 'pdf.convertToText(pathtoPdf)). Then after it converted the entire pdf to text, you could chose to just look at a particular section. The pdf is too large and TestComplete will fail if I try reading the entire file to a string.



function GetDateValuesFromPDF()
  // Get the path to the tested PDF file
  var path = "C:\\work\\sample.pdf";
  if ((path != "") && (aqFile.Exists(path)) && (aqFileSystem.GetFileExtension(path) == "pdf"))
    // Get the entire file contents
    contents = PDF.ConvertToText(path);
    if (contents != "")
      // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim

      // Post all the date values that match the specified pattern
      // to the test log
      var r = contents.match(regEx);
      if (r != null && r.length > 0)
        for (var i = 0; i < r.length; i++ )


Yes, I just verified that the below code will fail at line, " LastResult = PDF.ConvertToText(path);"


function TestExtractTextFromLargePdfFile()
  var Var1, LastResult, LastResult1;
  Var1 = "";
  //Extracts plain text from a *.pdf document.
  var path = "C:\\temp\\LargeFile.pdf";
  LastResult = PDF.ConvertToText(path);
  Var1 = LastResult;
  //Writes the specified string to a text file. 
  LastResult1 = aqFile.WriteToTextFile("C:\\Users\\chk\\Documents\\TestCompleteOutputFile.txt", Var1, aqFile.ctUTF8);

 the error message is, "The OCR service failed. Request Entity Too large."

Champion Level 3

Since you're not looking at the whole .pdf file, would it work just as well to see if the expected file exists in a particular folder and is a particular size?

Well, I really need to get the version and the date from the first page to make sure we are pointing to the correct version of the pdf document. I was using the OCR which was working pretty good until (I'm pretty sure) our security team put some software on our machines that seems to be blocking OCR for *.pdf being viewed via a browser. I'll work more on trying to get the OCR to work again and report back. Thanks so much @Marsha_R for the guidance. I mainly wanted to know if there was a way to extract text from a single page of a pdf with the TC pdf tool and it looks like probably not. Thanks again for the answer.

Showing results for 
Search instead for 
Did you mean: