get text from first page of pdf file with 'PDF to Text'

Question

I need to verify content on the first page of a pdf document. I cannot just simply extract text from the entire document. The document is to large and attempting to read all text will end up in an error.

from the TC documentation on pdf to text (here), I see I can extract text from a certain section of the pdf ... but not from certain pages. Is there a way to extract text from certain pages of the pdf file ... a way to extract text from the first page of a pdf file?

marsha_r · Accepted Answer

Since you're not looking at the whole .pdf file, would it work just as well to see if the expected file exists in a particular folder and is a particular size?
&nbsp;
https://support.smartbear.com/testcomplete/docs/reference/program-objects/aqfile/methods.html

marsha_r · Answer

You just picked the page as a way to cut down the text, right? There are other ways to get just a part of the text.
&nbsp;
https://support.smartbear.com/testcomplete/docs/testing-with/working-with-external-data-sources/pdf.html
&nbsp;
if you use the In Script Tests example, then you could use substring to pick out the first 200 characters or whatever works for you.
&nbsp;
If you the Extract Section Contents, then you could decide which section(s) worked for you and pick those.

torus · Answer

I will give the section extraction a try. However, when I was reading the code, it seemed like both of these techniques first extracted all text from the file (with 'pdf.convertToText(pathtoPdf)). Then after it converted the entire pdf to text, you could chose to just look at a particular section. The pdf is too large and TestComplete will fail if I try reading the entire file to a string.&nbsp;&nbsp;function GetDateValuesFromPDF()
{
  // Get the path to the tested PDF file
  var path = "C:\work\sample.pdf";
  if ((path != "") &amp;&amp; (aqFile.Exists(path)) &amp;&amp; (aqFileSystem.GetFileExtension(path) == "pdf"))
  {
    // Get the entire file contents
    contents = PDF.ConvertToText(path);
    if (contents != "")
    {
      // This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
      regEx = /\d{1,2}/\d{1,2}/\d{2,4}/gim

// Post all the date values that match the specified pattern
      // to the test log
      var r = contents.match(regEx);
      if (r != null &amp;&amp; r.length &gt; 0)
      {
        for (var i = 0; i &lt; r.length; i++ )
          Log.Message(r[0]);
      }
    }
  }
}&nbsp;

torus · Answer

Yes, I just verified that the below code will fail at line, " LastResult = PDF.ConvertToText(path);"&nbsp;function TestExtractTextFromLargePdfFile()
{
  var Var1, LastResult, LastResult1;
  Var1 = "";
  //Extracts plain text from a *.pdf document.
  var path = "C:\temp\LargeFile.pdf";
  LastResult = PDF.ConvertToText(path);
  Var1 = LastResult;
  //Writes the specified string to a text file. 
  LastResult1 = aqFile.WriteToTextFile("C:\Users\chk\Documents\TestCompleteOutputFile.txt", Var1, aqFile.ctUTF8);
}&nbsp;the error message is, "The OCR service failed. Request Entity Too large."

torus · Answer

Well, I really need to get the version and the date from the first page to make sure we are pointing to the correct version of the pdf document. I was using the OCR which was working pretty good until (I'm pretty sure) our security team put some software on our machines that seems to be blocking OCR for *.pdf being viewed via a browser. I'll work more on trying to get the OCR to work again and report back. Thanks so much&nbsp;Marsha_R&nbsp;for the guidance. I mainly wanted to know if there was a way to extract text from a single page of a pdf with the TC pdf tool and it looks like probably not. Thanks again for the answer.

Forum Discussion

get text from first page of pdf file with 'PDF to Text'

5 Replies

Related Content

xPath on text

TestComplete text size on software

verify color of text and textbox

Video recording of Text Execution

Test keeps failing with Text "..." was not found

Recent Discussions

GIT option are greyed out

Dynamic Object

Inspecting a DotNetBrowser Chromium control

Pulling in Dynamic Links from a web page

Multiple execution of a test plan with different values from an Excel table