Ask a Question

How to compare contents of PDF & MSWORD using TestComplete

kathir_43
Contributor

How to compare contents of PDF & MSWORD using TestComplete

we are planning to compare the contents of PDF and msword using TestComplete. I am able to read the contents of PDF & Word doc. I have to compare the contents of these files. The Contents in the word is like a template and PDF is the actual value. Can some one share any logic way of performing these comparisons? Pls note we are not using OCR in testcomplete

6 REPLIES 6
mattb
Staff

Hi,

Without OCR this will be more difficult. I have example scripts using OCR comparing PDFS and masking data as well. 


This is just an Idea:

We would probably need to convert the PDF into something else, like an XML. Once in an XML or another file we could call a file comparison. I think its easiest done in XML since parsing that file is relatively easy. 
https://support.smartbear.com/testcomplete/docs/testing-with/working-with-external-data-sources/xml....

 

I am able to extract the contents of word & PDF file to text files respectively. Word document is kind of template in which outline will be defined (how pdf should be generated ).Below sample is given.

Word doc:

<firstname>,<LastName>

<ID>,<organisation>

<salary>,<place>

 

Dear <firstname>,

you are working in the department of <organisation> and we are really honored to have you here. Expecting many more successful years of service from you.

Thanks,

 

Actual PDF:

John,Kennedy

234,google

USD1245,CA

Dear John,

you are working in the department of google and we are really honored to have you here. Expecting many more successful years of service from you.

Thanks,

can someone help with the comparison logic to validate both the static and dynamic content are getting generated as expected??

Hi,

What language are you using for scripting? 

javascript

Any suggestions??

Hi,

We have native methods to compare the files, I think that part is easy. The harder part will be masking the dynamic strings. What I have done in the past is removed data that matches a certain pattern, resave the file, and then compare. An example in python where I mask the dates is provided below: 

def ComparePDF(path1, path2):
#make sure parameters are valid paths to pdf files
if (path1 != "" and aqFile.Exists(path1) and aqFileSystem.GetFileExtension(path1) == "pdf" and \
path2 != "" and aqFile.Exists(path2) and aqFileSystem.GetFileExtension(path2) == "pdf"):
# Get the text contents of PDF files
str1 = PDF.ConvertToText(path1)
str2 = PDF.ConvertToText(path2)

# Use the regular expression
# to replace the date/time stamp
regEx = "[\d]{1,2}/[\d]{1,2}/[\d]{4}"

#using re.sub method to replace dates with a constant str
str1 = re.sub(regEx, "<ignore_date>", str1)
str2 = re.sub(regEx, "<ignore_date>", str2)

#log the full text with replaced date values to show that the regular expression filtering worked for both pdf texts
Log.Message(str1)
Log.Message(str2)

# Compare the resulting contents
if (str1 == str2):
Log.Message("The text contents of specified PDF files are the same")
else:
Log.Message("The text contents are different")

 

cancel
Showing results for 
Search instead for 
Did you mean: