Forum Discussion

leahey's avatar
leahey
Occasional Contributor
11 years ago

Get actual page source, including comments?

Greetings, 



How best to get access to the entirety of a web page's source, including non-html elements prior to the <html> tag? 



We're embedding some comments into our page source that our testers need to access for conditional page validation. How best to get access to those comments? 

Everything I've seen only loads the document source post-parse, which leaves out the comments, obviously. 



e.g.:



<!-- MeaningfulComment = True -->

<!DOCTYPE html>

<html>

  <head>

etc. 






  • Hi Robert,


     


    Last time, I missed the fact that the comments are located outside the html tag in your case.


    I don't see other options except saving the web page to html or txt and reading the first lines in the file.


    Saving the page should be possible via Page("*").Keys("^S").


     

  • HKosova's avatar
    HKosova
    SmartBear Alumni (Retired)


    Hi Robert,



    According to this answer on StackOverflow, you can use document.firstChild to access the comment before DOCTYPE.



    If you need the entire page source, you could use an XMLHTTP GET request:

    function Test()

    {

      var strURL = "http://www.example.com";



      var objHTTP = new ActiveXObject("MSXML2.XMLHTTP");

      objHTTP.open("GET", strURL, false);

      objHTTP.send();



      while ((objHTTP.readyState != 4) && (objHTTP.readyState != 'complete'))

      {

        Delay(100);

      }



      Log.Message("See Additional Info", objHTTP.responseText);

    }


  • You can also try this :

    Set WshShell = CreateObject("WScript.Shell")

      Set http = CreateObject("Microsoft.XmlHttp")

      http.open "GET", URL, FALSE

      http.send ""

      Log.Message("Respnse Text is :" & http.responseText)

  • TanyaYatskovska's avatar
    TanyaYatskovska
    SmartBear Alumni (Retired)

    Hi Robert,


     


    The following steps came to my mind:


    1. Obtaining html source of the page via:


    var html = browser.Page("*").contentDocument.documentElement.innerHTML


    2. Creating a regular expression to parse html to get only comments. Something like this:




    function GetComments()




     


      var regEx, Matches;


      var InStr = "<!--Comment-->";


     


      // Set regular expression pattern 


      regEx = /\<!--(\d*\w*)\-->/ig;


     


      // Perform search operation


      Matches = InStr.match(regEx);


      // Iterate through Matches array


      for (var i=0; i<Matches.length; i++)


        {


        Log.Message(Matches);


        }


    }




     


     


    Also, you can try using the EvaluateXPath method that can parse html based on XPath expression. However, I cannot tell you for sure if it works with comments.


     

  • leahey's avatar
    leahey
    Occasional Contributor
    Hi Tanya, thanks for the reply.



    Unfortunately, browser.Page("*").contentDocument.documentElement.innerHTML returns only the page source inside the <html> tags; it does not include the comments that preceed the <html> tag. 



    Similarly, OuterHTML only adds the <html> tags to what InnerHTML returns. 



    As a workaround, I'm actually creating a UI script to open the page source in the browser, and then do my validation against that result, but that's crazy. There has to be a more proper way to get to the full, pre-parse page source...
  • TanyaYatskovska's avatar
    TanyaYatskovska
    SmartBear Alumni (Retired)

    Hi Robert,


     


    Last time, I missed the fact that the comments are located outside the html tag in your case.


    I don't see other options except saving the web page to html or txt and reading the first lines in the file.


    Saving the page should be possible via Page("*").Keys("^S").