1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Using PreviousSibling

Discussion in 'VBA Macros' started by YasserKhalil, Feb 15, 2018.

  1. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Hello everyone
    I am trying to scrape a web page and there will be an element like that
    PreviousSibling.png

    The green rectangle is the class name that I identified easily .. and I need to go back using previous sibling to refer to the blue rectangle part

    I used that part
    Code (vb):
    For Each eRow In .document.getElementById("PropertySummary_FactsTable").getElementsByTagName("tr")
        'Debug.Print eRow.innerText
     
        If InStr(eRow.innerText, "Garage") > 0 And Not (InStr(eRow.innerText, "Garage (spaces)") > 0) Then
            MsgBox eRow.getElementsByClassName("changed").PreviousSibling.innerhtml
            Stop
        End If
    Next
    and I tried some combinations but didn't work for me
    Any ideas please
  2. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    Have a read of link.
    https://chandoo.org/forum/threads/h...elational-structure-of-dom.36544/#post-219208

    Use the FreeFile() method to to export out innerHTML and inspect how it's structured.

    By the way, using DOM node structure to scrape data is very inefficient and pain in the rear. It's usually much easier to scrape using RegEx, string manipulation etc.

    DOM node structure is more useful when you have to pass info to web site using structured document (usually xml).
    YasserKhalil likes this.
  3. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thanks a lot for reply ..
    Can you suggest me a proper line so as to be able to refer to the previous tag name before the class "changed"

    This line works fine
    Code (vb):
    MsgBox eRow.getElementsByClassName("changed")(0).innerhtml
    but when using previoussibling it doesn't work
  4. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    It depends on source document structure.
    Using FreeFile method export the entire document.body.innerHTML to text file and upload.
    Without it, it's too much of guessing game.
  5. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thanks a lot
    Here's the HTML document

    Attached Files:

  6. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Any idea in this topic please
  7. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    Patience is a virtue. As you are aware, I'm in different timezone.

    At any rate, my bad. I should have said, export full .Document to text.
    Not just .body.innerHtml.
    YasserKhalil likes this.
  8. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thanks a lot for reply
    I just know this way .. How to export full document
    Should I replace body with nothing in that way
    ie.document
  9. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    I assume you are piloting IE? Then just right click on page, "View Source" and copy result to text editor (I recommend NotePad ++).
  10. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thanks a lot ..
    As a matter of curiosity how to export the full document using the codes ..?
    I was using these lines
    Code (vb):
    Dim f As Integer
            f = FreeFile()
            Open ThisWorkbook.Path & "\Sample.txt" For Output As #f
            Print #f, .document.body.innerHTML
            Close #f

    Attached Files:

  11. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    You are piloting IE so just use it's saveas.

    Code (vb):
            ie.ExecWB OLECMDID_SAVEAS, OLECMDEXECOPT_DONTPROMPTUSER
    Not sure if you can skip the prompt to save. May be just use sendkey.
    YasserKhalil likes this.
  12. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
  13. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    So in order to use PreviousSibling

    You need to first set ChildNodes collection of object.

    Ex:
    Code (vb):
    Set node1 = html.getElementsByTagName("Tbody")(0)
    Set mChildren = node1.ChildNodes
    So if you want to access previous sibling of node with innertext that contains "Garage"...

    Code (vb):
    Set node1 = html.getElementsByTagName("Tbody")(0)
    Set mChildren = node1.ChildNodes
    For i = 0 To mChildren.Length - 1
        If InStr(mChildren(i).innerText, "Garage") Then
            Debug.Print mChildren(i).PreviousSibling.innerText
        End If
    Next
     
    But as I have said before, this is really circuitous way of getting at the info you are after.
    YasserKhalil likes this.
  14. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thanks a lot ..
    I tried the previous lines but encountered an error at this line
    Code (vb):
    If InStr(mChildren(i).innerText, "Garage") Then
    * If there is a direct way it would be better for sure
  15. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    Well, it really depends on what's in the source document.

    In most cases you don't want to check innerText as that's dependent on how the site populates values.

    Using your sample text file. This is how I did it.
    Code (vb):
    Sub Demo()
    Dim html As New HTMLDocument
    Dim fPath As String: fPath = "C:\test\Sample.txt"
    Dim strContent As String
    Dim intFF As Integer: intFF = FreeFile()
    Dim mChildren As Object, node1 as Object
    Open fPath For Input As #intFF
    strContent = Input(LOF(intFF), intFF)
    Close #intFF

    html.body.innerHTML = strContent

    Set node1 = html.getElementsByTagName("Tbody")(0)
    Set mChildren = node1.ChildNodes
    For i = 0 To mChildren.Length - 1
        If InStr(mChildren(i).innerText, "Garage") Then
            Debug.Print mChildren(i).PreviousSibling.innerText
        End If
    Next

    End Sub
    YasserKhalil likes this.
  16. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thanks a lot Mr. Chihiro for your patience ..
    First I can't find any tbody in the text file (Sample.txt)

    Second when testing the code I got this string "Lot Dimensions7018 SF " ..
    Have a look at the snapshot please
    Changed.png
  17. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    Well... my code is just a sample of "HOW" to use nodes.

    There's more than 1 td class="changed"...

    As well...
    When you have <table> tag in the text and when it's loaded to html.documents. This process will create TBODY inside Table element.

    You can test this by setting node1 = html.getElementsByTagName("Table")(0)
    and printing innerHTML.

    upload_2018-2-16_11-37-14.png
  18. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    At any rate, this is part of the reason why I said node isn't the ideal method of scraping info.

    Code (vb):
    Sub Demo()
    Dim html As New HTMLDocument
    Dim fPath As String: fPath = "C:\test\Sample.txt"
    Dim strContent As String
    Dim intFF As Integer: intFF = FreeFile()
    Dim elem As Object
    Dim ar
    Open fPath For Input As #intFF
    strContent = Input(LOF(intFF), intFF)
    Close #intFF

    html.body.innerHTML = strContent

    Set mtbl = html.getElementsByTagName("Table")(0)
    i = 1
    ReDim ar(1 To mtbl.getElementsByTagName("tr").Length, 1 To 2)
    For Each trow In mtbl.getElementsByTagName("tr")
        ar(i, 1) = trow.ChildNodes(0).innerText
        ar(i, 2) = trow.ChildNodes(1).innerText
        i = i + 1
    Next

    [a1].Resize(UBound(ar), 2) = ar
    End Sub
    Then just use Match function or some other logic on array to find the value you need.
    YasserKhalil likes this.
  19. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thank you very much Mr. Chihro for this great code. I have learned a lot from the last code

    * As you have devsied the code you supposed two columns inside the table and that's true for this example .. The problem is that there are other pages which has the same table but with three columns not only two .. so I tended to depend on the class "changed" and grab the previous tag name td
    Hope the issue is clear now
  20. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    Give example of such (same format as sample.txt you uploaded).

    Likely this site uses last column as drop down selection column.

    So if you loop through all rows and check for max length of the row.
    Then subtract 1 from the max length. That should be your column dimension for the array (1 to row.Length, 1 to maxColumn.length - 1).

    You'd need to loop twice, but it's fast loop and won't impact performance much.
  21. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    Thank you for guiding me ..
    Here's another example and I will try to do it my self and back to you

    Here Garage equals to "-" and this is not the desired output

    Attached Files:

  22. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    Here.
    Code (vb):
    Sub Demo()
    Dim html As New HTMLDocument
    Dim fPath As String: fPath = "C:\test\Sample2.txt"
    Dim strContent As String
    Dim intFF As Integer: intFF = FreeFile()
    Dim ar
    Dim i As Long, maxCol As Long, j As Long
    maxCol = 0
    Open fPath For Input As #intFF
    strContent = Input(LOF(intFF), intFF)
    Close #intFF

    html.body.innerHTML = strContent

    Set mtbl = html.getElementsByTagName("Table")(0)
    For Each trow In mtbl.getElementsByTagName("tr")
        If trow.getElementsByTagName("td").Length > maxCol Then
            maxCol = trow.getElementsByTagName("td").Length
        End If
    Next
    i = 1
    ReDim ar(1 To mtbl.getElementsByTagName("tr").Length, 1 To maxCol - 1)
    For Each trow In mtbl.getElementsByTagName("tr")
        On Error Resume Next
        For j = 1 To maxCol - 1
            ar(i, j) = trow.ChildNodes(j - 1).innerText
        Next
        i = i + 1
    Next

    [a1].Resize(UBound(ar), maxCol - 1) = ar
    End Sub
    YasserKhalil likes this.
  23. YasserKhalil

    YasserKhalil Active Member

    Messages:
    989
    That's amazing my tutor. Thank you very very much for this valuable solutions
    Best and kind regards
  24. shahin

    shahin Active Member

    Messages:
    887
    Is it something which can be considered as well? I don't know the to-be output. However, took a guess. I have tried with the document attached in post 5.
    Code (vb):

    Sub LoadFile()
        Dim strContent As String, HTML As New HTMLDocument, post As Object

        Open "C:\Users\ar\Desktop\HTML Document.txt" For Binary As #1
        strContent = Space$(LOF(1))
        Get #1, , strContent
        Close #1
        HTML.body.innerHTML = strContent

        For Each post In HTML.getElementsByClassName("title")
            r = r + 1: Cells(r, 1) = post.innerText
            Cells(r, 2) = post.NextSibling.innerText
        Next post
    End Sub
     
    YasserKhalil likes this.
  25. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,820
    That won’t work on 3 column.
    Though you should be able to do same double loop logic

Share This Page