Using PreviousSibling

YasserKhalil · Feb 15, 2018

Hello everyone
I am trying to scrape a web page and there will be an element like that

The green rectangle is the class name that I identified easily .. and I need to go back using previous sibling to refer to the blue rectangle part

I used that part

Code:

For Each eRow In .document.getElementById("PropertySummary_FactsTable").getElementsByTagName("tr")
    'Debug.Print eRow.innerText
   
    If InStr(eRow.innerText, "Garage") > 0 And Not (InStr(eRow.innerText, "Garage (spaces)") > 0) Then
        MsgBox eRow.getElementsByClassName("changed").PreviousSibling.innerhtml
        Stop
    End If
Next

and I tried some combinations but didn't work for me
Any ideas please

Chihiro · Feb 15, 2018

Have a read of link.
https://chandoo.org/forum/threads/h...elational-structure-of-dom.36544/#post-219208

Use the FreeFile() method to to export out innerHTML and inspect how it's structured.

By the way, using DOM node structure to scrape data is very inefficient and pain in the rear. It's usually much easier to scrape using RegEx, string manipulation etc.

DOM node structure is more useful when you have to pass info to web site using structured document (usually xml).

YasserKhalil · Feb 15, 2018

Thanks a lot for reply ..
Can you suggest me a proper line so as to be able to refer to the previous tag name before the class "changed"

This line works fine

Code:

MsgBox eRow.getElementsByClassName("changed")(0).innerhtml

but when using previoussibling it doesn't work

Chihiro · Feb 15, 2018

It depends on source document structure.
Using FreeFile method export the entire document.body.innerHTML to text file and upload.
Without it, it's too much of guessing game.

YasserKhalil · Feb 16, 2018

Thanks a lot
Here's the HTML document

YasserKhalil · Feb 16, 2018

Any idea in this topic please

Chihiro · Feb 16, 2018

Patience is a virtue. As you are aware, I'm in different timezone.

At any rate, my bad. I should have said, export full .Document to text.
Not just .body.innerHtml.

YasserKhalil · Feb 16, 2018

Thanks a lot for reply
I just know this way .. How to export full document
Should I replace body with nothing in that way
ie.document

Chihiro · Feb 16, 2018

I assume you are piloting IE? Then just right click on page, "View Source" and copy result to text editor (I recommend NotePad ++).

YasserKhalil · Feb 16, 2018

Thanks a lot ..
As a matter of curiosity how to export the full document using the codes ..?
I was using these lines

Code:

Dim f As Integer
        f = FreeFile()
        Open ThisWorkbook.Path & "\Sample.txt" For Output As #f
        Print #f, .document.body.innerHTML
        Close #f

Chihiro · Feb 16, 2018

You are piloting IE so just use it's saveas.

Code:

        ie.ExecWB OLECMDID_SAVEAS, OLECMDEXECOPT_DONTPROMPTUSER

Not sure if you can skip the prompt to save. May be just use sendkey.

YasserKhalil · Feb 16, 2018

I was searching for that issue today and this is the link of the thread
http://www.eileenslounge.com/viewtopic.php?f=30&t=29168

Chihiro · Feb 16, 2018

So in order to use PreviousSibling

You need to first set ChildNodes collection of object.

Ex:

Code:

Set node1 = html.getElementsByTagName("Tbody")(0)
Set mChildren = node1.ChildNodes

So if you want to access previous sibling of node with innertext that contains "Garage"...

Code:

Set node1 = html.getElementsByTagName("Tbody")(0)
Set mChildren = node1.ChildNodes
For i = 0 To mChildren.Length - 1
    If InStr(mChildren(i).innerText, "Garage") Then
        Debug.Print mChildren(i).PreviousSibling.innerText
    End If
Next

But as I have said before, this is really circuitous way of getting at the info you are after.

YasserKhalil · Feb 16, 2018

Thanks a lot ..
I tried the previous lines but encountered an error at this line

Code:

If InStr(mChildren(i).innerText, "Garage") Then

* If there is a direct way it would be better for sure

Chihiro · Feb 16, 2018

Well, it really depends on what's in the source document.

In most cases you don't want to check innerText as that's dependent on how the site populates values.

Using your sample text file. This is how I did it.

Code:

Sub Demo()
Dim html As New HTMLDocument
Dim fPath As String: fPath = "C:\test\Sample.txt"
Dim strContent As String
Dim intFF As Integer: intFF = FreeFile()
Dim mChildren As Object, node1 as Object
Open fPath For Input As #intFF
strContent = Input(LOF(intFF), intFF)
Close #intFF

html.body.innerHTML = strContent

Set node1 = html.getElementsByTagName("Tbody")(0)
Set mChildren = node1.ChildNodes
For i = 0 To mChildren.Length - 1
    If InStr(mChildren(i).innerText, "Garage") Then
        Debug.Print mChildren(i).PreviousSibling.innerText
    End If
Next

End Sub

YasserKhalil · Feb 16, 2018

Thanks a lot Mr. Chihiro for your patience ..
First I can't find any tbody in the text file (Sample.txt)

Second when testing the code I got this string "Lot Dimensions7018 SF " ..
Have a look at the snapshot please

Chihiro · Feb 16, 2018

Well... my code is just a sample of "HOW" to use nodes.

There's more than 1 td class="changed"...

As well...
When you have <table> tag in the text and when it's loaded to html.documents. This process will create TBODY inside Table element.

You can test this by setting node1 = html.getElementsByTagName("Table")(0)
and printing innerHTML.

Chihiro · Feb 16, 2018

At any rate, this is part of the reason why I said node isn't the ideal method of scraping info.

Code:

Sub Demo()
Dim html As New HTMLDocument
Dim fPath As String: fPath = "C:\test\Sample.txt"
Dim strContent As String
Dim intFF As Integer: intFF = FreeFile()
Dim elem As Object
Dim ar
Open fPath For Input As #intFF
strContent = Input(LOF(intFF), intFF)
Close #intFF

html.body.innerHTML = strContent

Set mtbl = html.getElementsByTagName("Table")(0)
i = 1
ReDim ar(1 To mtbl.getElementsByTagName("tr").Length, 1 To 2)
For Each trow In mtbl.getElementsByTagName("tr")
    ar(i, 1) = trow.ChildNodes(0).innerText
    ar(i, 2) = trow.ChildNodes(1).innerText
    i = i + 1
Next

[a1].Resize(UBound(ar), 2) = ar
End Sub

Then just use Match function or some other logic on array to find the value you need.

YasserKhalil · Feb 16, 2018

Thank you very much Mr. Chihro for this great code. I have learned a lot from the last code

* As you have devsied the code you supposed two columns inside the table and that's true for this example .. The problem is that there are other pages which has the same table but with three columns not only two .. so I tended to depend on the class "changed" and grab the previous tag name td
Hope the issue is clear now

Chihiro · Feb 16, 2018

Give example of such (same format as sample.txt you uploaded).

Likely this site uses last column as drop down selection column.

So if you loop through all rows and check for max length of the row.
Then subtract 1 from the max length. That should be your column dimension for the array (1 to row.Length, 1 to maxColumn.length - 1).

You'd need to loop twice, but it's fast loop and won't impact performance much.

YasserKhalil · Feb 16, 2018

Thank you for guiding me ..
Here's another example and I will try to do it my self and back to you

Here Garage equals to "-" and this is not the desired output

Chihiro · Feb 16, 2018

Here.

Code:

Sub Demo()
Dim html As New HTMLDocument
Dim fPath As String: fPath = "C:\test\Sample2.txt"
Dim strContent As String
Dim intFF As Integer: intFF = FreeFile()
Dim ar
Dim i As Long, maxCol As Long, j As Long
maxCol = 0
Open fPath For Input As #intFF
strContent = Input(LOF(intFF), intFF)
Close #intFF

html.body.innerHTML = strContent

Set mtbl = html.getElementsByTagName("Table")(0)
For Each trow In mtbl.getElementsByTagName("tr")
    If trow.getElementsByTagName("td").Length > maxCol Then
        maxCol = trow.getElementsByTagName("td").Length
    End If
Next
i = 1
ReDim ar(1 To mtbl.getElementsByTagName("tr").Length, 1 To maxCol - 1)
For Each trow In mtbl.getElementsByTagName("tr")
    On Error Resume Next
    For j = 1 To maxCol - 1
        ar(i, j) = trow.ChildNodes(j - 1).innerText
    Next
    i = i + 1
Next

[a1].Resize(UBound(ar), maxCol - 1) = ar
End Sub

YasserKhalil · Feb 16, 2018

That's amazing my tutor. Thank you very very much for this valuable solutions
Best and kind regards

shahin · Feb 16, 2018

Is it something which can be considered as well? I don't know the to-be output. However, took a guess. I have tried with the document attached in post 5.

Code:

Sub LoadFile()
    Dim strContent As String, HTML As New HTMLDocument, post As Object

    Open "C:\Users\ar\Desktop\HTML Document.txt" For Binary As #1
    strContent = Space$(LOF(1))
    Get #1, , strContent
    Close #1
    HTML.body.innerHTML = strContent

    For Each post In HTML.getElementsByClassName("title")
        r = r + 1: Cells(r, 1) = post.innerText
        Cells(r, 2) = post.NextSibling.innerText
    Next post
End Sub

Chihiro · Feb 16, 2018

That won’t work on 3 column.
Though you should be able to do same double loop logic

Using PreviousSibling

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Attachments

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Attachments

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Excel Ninja

Well-Known Member

Excel Ninja

Well-Known Member

Attachments

Excel Ninja

Well-Known Member

Active Member

Excel Ninja