• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Using PreviousSibling

YasserKhalil

Well-Known Member
Hello everyone
I am trying to scrape a web page and there will be an element like that
PreviousSibling.png

The green rectangle is the class name that I identified easily .. and I need to go back using previous sibling to refer to the blue rectangle part

I used that part
Code:
For Each eRow In .document.getElementById("PropertySummary_FactsTable").getElementsByTagName("tr")
    'Debug.Print eRow.innerText
   
    If InStr(eRow.innerText, "Garage") > 0 And Not (InStr(eRow.innerText, "Garage (spaces)") > 0) Then
        MsgBox eRow.getElementsByClassName("changed").PreviousSibling.innerhtml
        Stop
    End If
Next

and I tried some combinations but didn't work for me
Any ideas please
 
Have a read of link.
https://chandoo.org/forum/threads/h...elational-structure-of-dom.36544/#post-219208

Use the FreeFile() method to to export out innerHTML and inspect how it's structured.

By the way, using DOM node structure to scrape data is very inefficient and pain in the rear. It's usually much easier to scrape using RegEx, string manipulation etc.

DOM node structure is more useful when you have to pass info to web site using structured document (usually xml).
 
Thanks a lot for reply ..
Can you suggest me a proper line so as to be able to refer to the previous tag name before the class "changed"

This line works fine
Code:
MsgBox eRow.getElementsByClassName("changed")(0).innerhtml
but when using previoussibling it doesn't work
 
It depends on source document structure.
Using FreeFile method export the entire document.body.innerHTML to text file and upload.
Without it, it's too much of guessing game.
 
Patience is a virtue. As you are aware, I'm in different timezone.

At any rate, my bad. I should have said, export full .Document to text.
Not just .body.innerHtml.
 
Thanks a lot for reply
I just know this way .. How to export full document
Should I replace body with nothing in that way
ie.document
 
I assume you are piloting IE? Then just right click on page, "View Source" and copy result to text editor (I recommend NotePad ++).
 
Thanks a lot ..
As a matter of curiosity how to export the full document using the codes ..?
I was using these lines
Code:
Dim f As Integer
        f = FreeFile()
        Open ThisWorkbook.Path & "\Sample.txt" For Output As #f
        Print #f, .document.body.innerHTML
        Close #f
 

Attachments

  • Sample.txt
    131.7 KB · Views: 4
You are piloting IE so just use it's saveas.

Code:
        ie.ExecWB OLECMDID_SAVEAS, OLECMDEXECOPT_DONTPROMPTUSER

Not sure if you can skip the prompt to save. May be just use sendkey.
 
So in order to use PreviousSibling

You need to first set ChildNodes collection of object.

Ex:
Code:
Set node1 = html.getElementsByTagName("Tbody")(0)
Set mChildren = node1.ChildNodes

So if you want to access previous sibling of node with innertext that contains "Garage"...

Code:
Set node1 = html.getElementsByTagName("Tbody")(0)
Set mChildren = node1.ChildNodes
For i = 0 To mChildren.Length - 1
    If InStr(mChildren(i).innerText, "Garage") Then
        Debug.Print mChildren(i).PreviousSibling.innerText
    End If
Next

But as I have said before, this is really circuitous way of getting at the info you are after.
 
Thanks a lot ..
I tried the previous lines but encountered an error at this line
Code:
If InStr(mChildren(i).innerText, "Garage") Then

* If there is a direct way it would be better for sure
 
Well, it really depends on what's in the source document.

In most cases you don't want to check innerText as that's dependent on how the site populates values.

Using your sample text file. This is how I did it.
Code:
Sub Demo()
Dim html As New HTMLDocument
Dim fPath As String: fPath = "C:\test\Sample.txt"
Dim strContent As String
Dim intFF As Integer: intFF = FreeFile()
Dim mChildren As Object, node1 as Object
Open fPath For Input As #intFF
strContent = Input(LOF(intFF), intFF)
Close #intFF

html.body.innerHTML = strContent

Set node1 = html.getElementsByTagName("Tbody")(0)
Set mChildren = node1.ChildNodes
For i = 0 To mChildren.Length - 1
    If InStr(mChildren(i).innerText, "Garage") Then
        Debug.Print mChildren(i).PreviousSibling.innerText
    End If
Next

End Sub
 
Thanks a lot Mr. Chihiro for your patience ..
First I can't find any tbody in the text file (Sample.txt)

Second when testing the code I got this string "Lot Dimensions7018 SF " ..
Have a look at the snapshot please
Changed.png
 
Well... my code is just a sample of "HOW" to use nodes.

There's more than 1 td class="changed"...

As well...
When you have <table> tag in the text and when it's loaded to html.documents. This process will create TBODY inside Table element.

You can test this by setting node1 = html.getElementsByTagName("Table")(0)
and printing innerHTML.

upload_2018-2-16_11-37-14.png
 
At any rate, this is part of the reason why I said node isn't the ideal method of scraping info.

Code:
Sub Demo()
Dim html As New HTMLDocument
Dim fPath As String: fPath = "C:\test\Sample.txt"
Dim strContent As String
Dim intFF As Integer: intFF = FreeFile()
Dim elem As Object
Dim ar
Open fPath For Input As #intFF
strContent = Input(LOF(intFF), intFF)
Close #intFF

html.body.innerHTML = strContent

Set mtbl = html.getElementsByTagName("Table")(0)
i = 1
ReDim ar(1 To mtbl.getElementsByTagName("tr").Length, 1 To 2)
For Each trow In mtbl.getElementsByTagName("tr")
    ar(i, 1) = trow.ChildNodes(0).innerText
    ar(i, 2) = trow.ChildNodes(1).innerText
    i = i + 1
Next

[a1].Resize(UBound(ar), 2) = ar
End Sub

Then just use Match function or some other logic on array to find the value you need.
 
Thank you very much Mr. Chihro for this great code. I have learned a lot from the last code

* As you have devsied the code you supposed two columns inside the table and that's true for this example .. The problem is that there are other pages which has the same table but with three columns not only two .. so I tended to depend on the class "changed" and grab the previous tag name td
Hope the issue is clear now
 
Give example of such (same format as sample.txt you uploaded).

Likely this site uses last column as drop down selection column.

So if you loop through all rows and check for max length of the row.
Then subtract 1 from the max length. That should be your column dimension for the array (1 to row.Length, 1 to maxColumn.length - 1).

You'd need to loop twice, but it's fast loop and won't impact performance much.
 
Thank you for guiding me ..
Here's another example and I will try to do it my self and back to you

Here Garage equals to "-" and this is not the desired output
 

Attachments

  • Another Example.txt
    156.3 KB · Views: 3
Here.
Code:
Sub Demo()
Dim html As New HTMLDocument
Dim fPath As String: fPath = "C:\test\Sample2.txt"
Dim strContent As String
Dim intFF As Integer: intFF = FreeFile()
Dim ar
Dim i As Long, maxCol As Long, j As Long
maxCol = 0
Open fPath For Input As #intFF
strContent = Input(LOF(intFF), intFF)
Close #intFF

html.body.innerHTML = strContent

Set mtbl = html.getElementsByTagName("Table")(0)
For Each trow In mtbl.getElementsByTagName("tr")
    If trow.getElementsByTagName("td").Length > maxCol Then
        maxCol = trow.getElementsByTagName("td").Length
    End If
Next
i = 1
ReDim ar(1 To mtbl.getElementsByTagName("tr").Length, 1 To maxCol - 1)
For Each trow In mtbl.getElementsByTagName("tr")
    On Error Resume Next
    For j = 1 To maxCol - 1
        ar(i, j) = trow.ChildNodes(j - 1).innerText
    Next
    i = i + 1
Next

[a1].Resize(UBound(ar), maxCol - 1) = ar
End Sub
 
Is it something which can be considered as well? I don't know the to-be output. However, took a guess. I have tried with the document attached in post 5.
Code:
Sub LoadFile()
    Dim strContent As String, HTML As New HTMLDocument, post As Object

    Open "C:\Users\ar\Desktop\HTML Document.txt" For Binary As #1
    strContent = Space$(LOF(1))
    Get #1, , strContent
    Close #1
    HTML.body.innerHTML = strContent

    For Each post In HTML.getElementsByClassName("title")
        r = r + 1: Cells(r, 1) = post.innerText
        Cells(r, 2) = post.NextSibling.innerText
    Next post
End Sub
 
Back
Top