• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Having hard time understanding the relational structure of dom

shahin

Active Member
I've written a macro to parse certain paragraphs embedded within "p" tag which is again within "article" tag container. If i only wish to parse the paragraph then running the code I can do so. However, I'm slightly confused about the relational structure of dom. So far I knew (considering the below html snippet) that if "article" tag is the parent then it has got 6 children which are [div,h1,div,div,div,div] accordingly. When i try to execute the commented out portion of code (which has been written using the relational structure), it no longer works. What I'm missing here? Thanks in advance.

The code:
Code:
Sub demo()

    Dim http As New InternetExplorer, html As HTMLDocument
    Dim r As Long, elem As Object

    With http
        .Visible = False
        .navigate "http://themakeupblogger.com/makeup/"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With

'    Set elem = html.getElementsByTagName("article")(0).LastChild.PreviousSibling.getElementsByTagName("p")
'    [A1].Value = elem(0).innerText      '''It doesn't work
  
    Set elem = html.getElementsByTagName("article")(0).getElementsByTagName("div")(3).getElementsByTagName("p")
    [A1].Value = elem(0).innerText       '''It works

End Sub


Html elements (FYC):

Code:
<article id="entry-13600" class="entry  post-13600 post type-post status-publish format-standard has-post-thumbnail hentry category-makeup tag-memorial-day tag-summer-beauty">
  
    <div class="entry-meta entry-meta-top">
        <p class="entry-categories">
        <a href="http://themakeupblogger.com/makeup/" rel="category tag">Makeup</a>         </p>
    </div>
  
    <h1 class="entry-title">
        <a href="http://themakeupblogger.com/10-memorial-day-beauty-essentials-belong-beach-bag/">10 Memorial Day Beauty Essentials That Belong In Your Beach Bag</a>
    </h1>

    <div class="entry-meta entry-meta-bottom">
        <time class="entry-date" datetime="2017-05-25T10:56:16+00:00">May 25, 2017</time>
        <a href="http://themakeupblogger.com/10-memorial-day-beauty-essentials-belong-beach-bag/#comments" class="entry-comments-no">3 Comments</a>
    </div>
  
    <div class="entry-featured">
        <a href="http://themakeupblogger.com/10-memorial-day-beauty-essentials-belong-beach-bag/">
                    <img width="665" height="442" src="https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?fit=665%2C442" class="attachment-large size-large wp-post-image" alt="" srcset="https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?w=1150 1150w, https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?resize=400%2C266 400w, https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?resize=903%2C600 903w, https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?resize=665%2C442 665w, https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?resize=391%2C260 391w" sizes="(max-width: 665px) 100vw, 665px" data-attachment-id="13604" data-permalink="http://themakeupblogger.com/10-memorial-day-beauty-essentials-belong-beach-bag/beach/" data-orig-file="https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?fit=1150%2C764" data-orig-size="1150,764" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;1&quot;}" data-image-title="beach" data-image-description="" data-medium-file="https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?fit=400%2C266" data-large-file="https://i1.wp.com/themakeupblogger.com//nas/content/live/kimdo/images/2017/05/beach.jpg?fit=665%2C442">             </a>
    </div>
          
    <div class="entry-content">
        <p>Me Oh My Girl Halcyon beach days are here again! Even though the energy of summertime is about being wild and free and comfortable in your skin, it doesn’t necessarily mean you have to ditch makeup and hair altogether. But paring down and perfecting your seaside-beauty routine means less distractions from what really matters: soaking […]</p>
    </div>

    <div class="entry-utils group">
        <a href="http://themakeupblogger.com/10-memorial-day-beauty-essentials-belong-beach-bag/" class="read-more">Continue Reading</a>      
    </div>
  
</article>

Forgive me if I've addressed specific elements otherwise.
 
White space is considered node as well. So by doing something like below. You will see that first element in "article" collection has following child nodes.
Code:
Set elem = html.getElementsByTagName("article")(0).ChildNodes
For i = 0 To elem.Length - 1
    Debug.Print elem(i).nodeName
Next

Code:
#text
DIV
#text
H1
#text
DIV
#text
DIV
#text
DIV
#text
DIV
#text

So the lastChild that you are accessing is #text. And PrevioiusSibling is last DIV.

By using below code...
Code:
Set elem = html.getElementsByTagName("article")(0).LastChild.PreviousSibling
Debug.Print elem.innerHTML

You will see that there is no "p" tag within. Hence the error.

I'd suggest you spend sometime studying how nodes are accessed on W3School.
https://www.w3schools.com/jsref/prop_node_lastchild.asp
https://www.w3schools.com/jsref/prop_node_childnodes.asp
https://www.w3schools.com/jsref/prop_node_previoussibling.asp
 
The best explanation on dom so far. What I could not learn about dom in the last two years, i just got it from you. I owe you big time, sir. Thanksssssssss a lot.
 
One last thing sir:

This one works:
Code:
Sub demo()

    Dim http As New InternetExplorer, html As HTMLDocument
    Dim r As Long, elem As Object

    With http
        .Visible = False
        .navigate "http://themakeupblogger.com/makeup/"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With

    For Each elem In html.getElementsByTagName("article")
        With elem.getElementsByTagName("h1")
            If .Length Then r = r + 1: Cells(r, 1) = .Item(0).getElementsByTagName("a")(0).innerText
        End With
        With elem.getElementsByTagName("div")(3).getElementsByTagName("p")
            If .Length Then Cells(r, 2) = .Item(0).innerText
        End With
    Next elem

End Sub

But this one doesn't:
Code:
Sub demo()

    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim r As Long, elem As Object

    With http
        .Open "GET", "http://themakeupblogger.com/makeup/", False
        .send
        html.body.innerHTML = .responseText
    End With

    For Each elem In html.getElementsByTagName("article")
        With elem.getElementsByTagName("h1")
            If .Length Then r = r + 1: Cells(r, 1) = .Item(0).getElementsByTagName("a")(0).innerText
        End With
        With elem.getElementsByTagName("div")(3).getElementsByTagName("p")
            If .Length Then Cells(r, 2) = .Item(0).innerText
        End With
    Next elem

End Sub

I really don't find any reason for my second script to not work whereas the first one is working. If i don't include the article tag container then it works. Could you point out why it is not working. Again, sorry to bother you sir and thanks a trillion for the help you always provide me when in need.
 
Last edited:
If you use following in each of your code. You will see the difference between the two.
Code:
    For Each elem In html.getElementsByTagName("article")
        Debug.Print TypeName(elem)
    Next elem

One will have HTMLSemanticElement other will have HTMLUnknownElement. Hence the issue.

I don't know the exact cause, but likely IE is interpreting something before html is loaded with .document.

If you output html.body.innerHTML of each file, you will see difference in structure.

xmlhttp
upload_2017-11-28_15-4-53.png

IE automation
upload_2017-11-28_15-5-24.png
 
Few days back when I came across some messy text having been parsed while running a certain script, Narayan provided me with a very helpful piece of code to clean that. However, as you said and according to the links you provided, some "text" within "article" tag are being considered as child. So, to shake off the whitespace and disorganized pattern I tried like below but it doesn't seem to work. I googled a lot to know the process how to clean source-code but could not find any.

Code:
Sub demo()

    Dim http As New InternetExplorer, html As HTMLDocument
    Dim elem As Object

    With http
        .Visible = False
        .navigate "http://themakeupblogger.com/makeup/"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = Trim(WorksheetFunction.Clean(.document)) ''Tried with this
    End With
  
    Set elem = html.getElementsByTagName("article")(0).ChildNodes
    For i = 0 To elem.Length - 1
        Debug.Print elem(i).nodeName
    Next

End Sub
 
You don't need to clean it. Node (DOM structure) isn't particularly flexible nor is it easy to use for scraping purpose. It's more useful when you are parsing from raw data and pushing info into structured document.

For scraping, make use of string manipulation and other techniques as it's often faster and more flexible.
 
Yeah sir, I'll for sure. You have been a great help to me today, the day before today and so on. The information you provided in this thread was clearly my best lesson. Thanks a lot sir.
 
Back
Top