get text from first page of pdf file with 'PDF to Text'
SOLVED- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
get text from first page of pdf file with 'PDF to Text'
I need to verify content on the first page of a pdf document. I cannot just simply extract text from the entire document. The document is to large and attempting to read all text will end up in an error.
from the TC documentation on pdf to text (here), I see I can extract text from a certain section of the pdf ... but not from certain pages. Is there a way to extract text from certain pages of the pdf file ... a way to extract text from the first page of a pdf file?
Solved! Go to Solution.
- Labels:
-
Desktop Testing
-
Test Creation
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You just picked the page as a way to cut down the text, right? There are other ways to get just a part of the text.
if you use the In Script Tests example, then you could use substring to pick out the first 200 characters or whatever works for you.
If you the Extract Section Contents, then you could decide which section(s) worked for you and pick those.
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I will give the section extraction a try. However, when I was reading the code, it seemed like both of these techniques first extracted all text from the file (with 'pdf.convertToText(pathtoPdf)). Then after it converted the entire pdf to text, you could chose to just look at a particular section. The pdf is too large and TestComplete will fail if I try reading the entire file to a string.
function GetDateValuesFromPDF()
{
// Get the path to the tested PDF file
var path = "C:\\work\\sample.pdf";
if ((path != "") && (aqFile.Exists(path)) && (aqFileSystem.GetFileExtension(path) == "pdf"))
{
// Get the entire file contents
contents = PDF.ConvertToText(path);
if (contents != "")
{
// This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim
// Post all the date values that match the specified pattern
// to the test log
var r = contents.match(regEx);
if (r != null && r.length > 0)
{
for (var i = 0; i < r.length; i++ )
Log.Message(r[0]);
}
}
}
}
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I just verified that the below code will fail at line, " LastResult = PDF.ConvertToText(path);"
function TestExtractTextFromLargePdfFile()
{
var Var1, LastResult, LastResult1;
Var1 = "";
//Extracts plain text from a *.pdf document.
var path = "C:\\temp\\LargeFile.pdf";
LastResult = PDF.ConvertToText(path);
Var1 = LastResult;
//Writes the specified string to a text file.
LastResult1 = aqFile.WriteToTextFile("C:\\Users\\chk\\Documents\\TestCompleteOutputFile.txt", Var1, aqFile.ctUTF8);
}
the error message is, "The OCR service failed. Request Entity Too large."
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since you're not looking at the whole .pdf file, would it work just as well to see if the expected file exists in a particular folder and is a particular size?
https://support.smartbear.com/testcomplete/docs/reference/program-objects/aqfile/methods.html
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, I really need to get the version and the date from the first page to make sure we are pointing to the correct version of the pdf document. I was using the OCR which was working pretty good until (I'm pretty sure) our security team put some software on our machines that seems to be blocking OCR for *.pdf being viewed via a browser. I'll work more on trying to get the OCR to work again and report back. Thanks so much @Marsha_R for the guidance. I mainly wanted to know if there was a way to extract text from a single page of a pdf with the TC pdf tool and it looks like probably not. Thanks again for the answer.
