Scrape webpage fails when using QuerySelectorAll

YasserKhalil · Mar 31, 2020

Hello everyone
I am working on the following code and I am stuck at one point. I have commented that part

Code:

Sub MyTest()
    Const sURL As String = "https://www.gelbeseiten.de/Suche/Ambulante%20Pflegedienste/Bundesweit"
    Dim http As MSXML2.XMLHTTP60, html As HTMLDocument

    Set http = New MSXML2.XMLHTTP60
    Set html = New MSHTML.HTMLDocument
    'sURL = bURL & sASIN

    With http
        .Open "Get", sURL, False
        .send
        html.body.innerHTML = .responseText
        
       'ExportHTML .responseText
    End With

Dim post As Object

Set post = html.querySelectorAll(".mod-Treffer")

Dim i As Long, r As Long
Range("A1").Resize(1, 3).Value = Array("Title", "Phone", "Email")
r = 2
'Debug.Print html.getElementById("#transform_wrapper").getElementsByTagName("article")(0).innerHTML
For i = 0 To post.Length - 1
    Cells(r, 1).Value = post.Item(i).getElementsByTagName("h2")(0).innerText
    Cells(r, 2).Value = post.Item(i).getElementsByTagName("Address")(0).getElementsByTagName("p")(1).innerText
    
    'I am stuck with extracting the email
    
    Dim emailObj As Object
    If r = 3 Then Stop
    
    Dim aNodeList As Object, j As Long
    
    
    '---------------------------------------------------------
    'I AM STUCK AT THIS LINE
Set aNodeList = post.Item(i).querySelectorAll(".contains-icon-email")(0)
'---------------------------------------------------------

For j = 0 To aNodeList.Length - 1
    Debug.Print aNodeList.Item(j).innerText '< === as check
Next j
    
    
    
    'Set emailObj = post.Item(i).getElementsByClassName("contains-icon-email")
    'Set emailObj = post.Item(i).getElementsByTagName("a")(1)
'    If Not emailObj Is Nothing Then
'     If InStr(post.Item(i).getElementsByTagName("a")(1).href, "mailto:") Then
'        Cells(r, 3).Value = Split(Split(post.Item(i).getElementsByTagName("a")(1).href, "mailto:")(1), "?")(0)
'    End If
    
End If

   Set emailObj = Nothing: r = r + 1
Next i


    'Stop
    
        
    'GetMerchantURL = Replace(html.getElementById("merchant-info").getElementsByTagName("a")(0).href, "about:", "https://www.amazon.de")
    Set http = Nothing: Set html = Nothing
End Sub

The line that I am stuck at is

Code:

Set aNodeList = post.Item(i).querySelectorAll(".contains-icon-email")(0)

How to assign or refer to the email element ..?

Marc L · Mar 31, 2020

Hi,

for only the first element just use directly querySelector rather than querySelectorAll …

shahin · Apr 1, 2020

Try the following to get the aforesaid fields from that site:

Code:

Sub FetchContent()
    Const sURL$ = "https://www.gelbeseiten.de/Suche/Ambulante Pflegedienste/Bundesweit"
    Dim I&, R&, ws As Worksheet
    Dim HtmlDoc As HTMLDocument: Set HtmlDoc = New HTMLDocument
    Dim Http As XMLHTTP60: Set Http = New XMLHTTP60
    Dim Html As HTMLDocument: Set Html = New HTMLDocument
   
    With ThisWorkbook
        Set ws = .Worksheets("Sheet1")
    End With

    With Http
        .Open "Get", sURL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        Html.body.innerHTML = .responseText
    End With

    With Html.querySelectorAll("[data-teilnehmerid]")
        For I = 0 To .Length - 1
            HtmlDoc.body.innerHTML = .item(I).outerHTML
            R = R + 1: ws.Cells(R, 1) = HtmlDoc.querySelector("h2[data-wipe-name='Titel']").innerText
            ws.Cells(R, 2) = HtmlDoc.querySelector("p[class*='phoneNumber']").innerText
            On Error Resume Next
            ws.Cells(R, 3) = Split(Split(HtmlDoc.querySelector("a[href^='mailto:']").getAttribute("href"), "?")(0), "mailto:")(1)
            On Error GoTo 0
        Next I
    End With
End Sub

YasserKhalil · Apr 1, 2020

That's amazing my friend Shahin. Thank you very much for the gift

shahin · Apr 1, 2020

It's nice to find you in the loop after a long time.

YasserKhalil · Apr 1, 2020

I am here and there and everywhere

shahin · Apr 1, 2020

You might have already got rid of `On Error Resume Next` from the script. However, I thought to come up with the rectified portion anyway for future readers:

Code:

Dim oEmail As Object

With Html.querySelectorAll("[data-teilnehmerid]")
    For I = 0 To .Length - 1
        HtmlDoc.body.innerHTML = .item(I).outerHTML
        R = R + 1: ws.Cells(R, 1) = HtmlDoc.querySelector("h2[data-wipe-name='Titel']").innerText
        ws.Cells(R, 2) = HtmlDoc.querySelector("p[class*='phoneNumber']").innerText
        Set oEmail = HtmlDoc.querySelector("a[href^='mailto:']")
        If Not oEmail Is Nothing Then
            ws.Cells(R, 3) = Split(Split(oEmail.getAttribute("href"), "?")(0), "mailto:")(1)
        End If
    Next I
End With

YasserKhalil · Apr 1, 2020

Thanks a lot. As for testing purposes, I found the first one that you used the skip statement is faster than the newer one.

Scrape webpage fails when using QuerySelectorAll

YasserKhalil

Well-Known Member

Marc L

Excel Ninja

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

YasserKhalil

Well-Known Member