Forum Discussion

torus's avatar
torus
Contributor
2 years ago

get text from first page of pdf file with 'PDF to Text'

I need to verify content on the first page of a pdf document. I cannot just simply extract text from the entire document. The document is to large and attempting to read all text will end up in an error. 

 

from the TC documentation on pdf to text (here), I see I can extract text from a certain section of the pdf ... but not from certain pages. Is there a way to extract text from certain pages of the pdf file ... a way to extract text from the first page of a pdf file?

  • You just picked the page as a way to cut down the text, right? There are other ways to get just a part of the text.

     

    https://support.smartbear.com/testcomplete/docs/testing-with/working-with-external-data-sources/pdf.html

     

    if you use the In Script Tests example, then you could use substring to pick out the first 200 characters or whatever works for you.

     

    If you the Extract Section Contents, then you could decide which section(s) worked for you and pick those.

    • torus's avatar
      torus
      Contributor

      I will give the section extraction a try. However, when I was reading the code, it seemed like both of these techniques first extracted all text from the file (with 'pdf.convertToText(pathtoPdf)). Then after it converted the entire pdf to text, you could chose to just look at a particular section. The pdf is too large and TestComplete will fail if I try reading the entire file to a string.

       

       

      function GetDateValuesFromPDF()
      {
        // Get the path to the tested PDF file
        var path = "C:\\work\\sample.pdf";
        if ((path != "") && (aqFile.Exists(path)) && (aqFileSystem.GetFileExtension(path) == "pdf"))
        {
          // Get the entire file contents
          contents = PDF.ConvertToText(path);
          if (contents != "")
          {
            // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
            regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim
      
            // Post all the date values that match the specified pattern
            // to the test log
            var r = contents.match(regEx);
            if (r != null && r.length > 0)
            {
              for (var i = 0; i < r.length; i++ )
                Log.Message(r[0]);
            }
          }
        }
      }

       

      • torus's avatar
        torus
        Contributor

        Yes, I just verified that the below code will fail at line, " LastResult = PDF.ConvertToText(path);"

         

        function TestExtractTextFromLargePdfFile()
        {
          var Var1, LastResult, LastResult1;
          Var1 = "";
          //Extracts plain text from a *.pdf document.
          var path = "C:\\temp\\LargeFile.pdf";
          LastResult = PDF.ConvertToText(path);
          Var1 = LastResult;
          //Writes the specified string to a text file. 
          LastResult1 = aqFile.WriteToTextFile("C:\\Users\\chk\\Documents\\TestCompleteOutputFile.txt", Var1, aqFile.ctUTF8);
        }

         the error message is, "The OCR service failed. Request Entity Too large."