How to get newly produced urls outside of a for loop?

shahin · Dec 8, 2017

I've written a macro using InternetExplorer to get all the movie urls from a torrent site. When I execute my macro, it does fetch all the required urls. However, what I expect is to have those 20 urls outside of the for loop within my macro. When I drag the variable "n_url" (containing the newly scraped links) outside of the for loop and print it, i can get only the last url. How can I achieve it, I meant, I will print the "n_url" outside the loop and still get all the urls?

Code:

Sub torrent_info()
    Dim IE As New InternetExplorer, html As HTMLDocument, post As Object

    With IE
        .Visible = False
        .navigate "https://yts.am/browse-movies"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With

    For Each post In html.getElementsByClassName("browse-movie-bottom")
        With post.getElementsByTagName("a")
            If .Length Then n_url = .Item(0).href
        End With
'        Debug.Print n_url  ''it can fetch all the 20 urls from that page
    Next post
    Debug.Print n_url  ''if printed out, it will only fetch the last url
  
    IE.Quit
End Sub

YasserKhalil · Dec 8, 2017

Hello my friend
Try this line instead

Code:

If .Length Then n_url = n_url & IIf(n_url = "", "", vbCrLf) & .Item(0).href

shahin · Dec 8, 2017

Yes it is working. You did a very good job. If anything unclear, I'll get back to you. Thankssssssssss a lot.

YasserKhalil · Dec 8, 2017

You're welcome my friend. Glad I can offer some help
Regards

shahin · Dec 8, 2017

It seems I've found another way to deal with this problem. However, the only thing I need to manage is the first trailing space:

Code:

Sub torrent_info()
    Dim IE As New InternetExplorer, html As HTMLDocument, post As Object, n_url As String

    With IE
        .Visible = False
        .navigate "https://yts.am/browse-movies"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With
    
    For Each post In html.getElementsByClassName("browse-movie-bottom")
        With post.getElementsByTagName("a")
            If .Length Then n_url = n_url & vbNewLine & .Item(0).href
        End With
    Next post
    MsgBox n_url
  
    IE.Quit
End Sub

YasserKhalil · Dec 8, 2017

In post # 2 just replace vbCrLf with vbNewLine .. That's all

shahin · Dec 8, 2017

@YasserKhalil, The code you provided doesn't contain any flaw, I meant any trailing space but the way I've written contains a trailing space which I expected any workaround for. Look at the below expression.

Code:

If .Length Then n_url = n_url & vbNewLine & .Item(0).href

YasserKhalil · Dec 8, 2017

That's because in the first loop, the string variable "n_url" is still empty and you add vbNewLine first to the string so there is a trailing space
So as for mine I used IIF to put a simple criteria if the "n_url" is empty then to skip vbNewLine but if it is not empty then to add vbNewLine
Hope that helps and sorry for my English

shahin · Dec 8, 2017

I tried to create a scraper which is the combination of "InternetExplorer" and "XMLHTTP" request. I wanted to do this because I came across few sites which have got javascript enabled only Its first page but the rest of the pages, I meant if you go deeper, you won't face any barrier to pull out data with 'XMLHTTP" request. I've already created one which is working just fine, although the experimental site is not javascript encrypted but this method should work when in need.

Code:

Sub torrent_info()
    Dim IE As New InternetExplorer, ihtml As HTMLDocument, elem As Object
    Dim http As New XMLHTTP60, html As New HTMLDocument, post As Object
    Dim n_url As String, link As Variant, item_vault As Variant

    With IE
        .Visible = True
        .navigate "https://yts.am/browse-movies"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set ihtml = .document
    End With

    For Each post In ihtml.getElementsByClassName("browse-movie-bottom")
        With post.getElementsByTagName("a")
'            If .Length Then n_url = n_url & vbNewLine & .Item(0).href
           If .Length Then n_url = n_url & IIf(n_url = "", "", vbCrLf) & .Item(0).href
        End With
    Next post
    IE.Quit
   
    item_vault = Split(n_url, vbNewLine)

    For Each link In item_vault
        With http
            .Open "GET", link, False
            .send
            html.body.innerHTML = .responseText
        End With
        For Each elem In html.getElementsByClassName("hidden-xs")
            With elem.getElementsByTagName("h1")
                If .Length Then r = r + 1: Cells(r, 1) = .Item(0).innerText
            End With
        Next elem
    Next link
End Sub

shahin · Dec 8, 2017

I still expect to have any solution to kick out that trailing space for my post # 7.

YasserKhalil · Dec 8, 2017

Have you thought of using TRIM worksheet function ?

shahin · Dec 8, 2017

If I only apply "Trim()" then it doesn't bring about any change? How does that worksheet function look like?

YasserKhalil · Dec 8, 2017

may be
Application.WorksheetFunction.Trim(n_url)

Chihiro · Dec 8, 2017

This will get rid of first vbNewLine.

Code:

Replace(n_url, vbNewLine, "", 1, 1)

FYI - It's not space but vbNewLine... so Trim won't get rid of it. When I checked couldn't find any trailing/preceding space.

shahin · Dec 8, 2017

But how can I use it in my scraper, sir. Thanks in advance.

Chihiro · Dec 8, 2017

... Really, you've got to understand your own code a bit better, as I've indicated in the past. Monitor variable using local window and immediate window along with watch window to see how variable changes with each iteration.

You'll see that extra vbNewline is tacked on at start. You can eliminate it by doing If check (via standard IF Then Else, or IIF). As indicated by YasserKhalil. But since you don't want to do it that way...

Only 1 instance of extra vbNewLine is present at final string... therefore you'd do replace on final string (i.e. outside loop).

Code:

Sub torrent_info()
  Dim IE AsNew InternetExplorer, html As HTMLDocument, post AsObject, n_url AsString

  With IE
        .Visible = False
        .navigate "https://yts.am/browse-movies"
      DoUntil .readyState = READYSTATE_COMPLETE: Loop
      Set html = .document
  EndWith

  ForEach post In html.getElementsByClassName("browse-movie-bottom")
      With post.getElementsByTagName("a")
          If .Length Then n_url = n_url & vbNewLine & .Item(0).href
      EndWith
  Next post
    MsgBox Replace(n_url, vbNewLine, "", 1, 1)

    IE.Quit
EndSub

shahin · Dec 8, 2017

Thanks sir for the guideline. At this point, it seems the way YasserKhalil showed is comparatively easier to follow. I thought the way i started will end up with lesser pain but turned out to be the opposite.

shahin · Feb 11, 2018

So, the best way is again how sir Chihiro suggested the other day (using dictionary) in another thread which more or less similar to this issue:

Code:

Sub torrent_info()
    Dim IE As New InternetExplorer, ihtml As HTMLDocument, elem As Object
    Dim http As New XMLHTTP60, html As New HTMLDocument, post As Object
    Dim itemdict As Object, key As Variant

    Set itemdict = CreateObject("Scripting.Dictionary")
  
    With IE
        .Visible = False
        .navigate "https://yts.am/browse-movies"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set ihtml = .document
    End With

    For Each post In ihtml.getElementsByClassName("browse-movie-bottom")
        With post.getElementsByTagName("a")
          If .Length Then itemdict(.Item(0).href) = 1
        End With
    Next post
    IE.Quit
  
    For Each key In itemdict.keys
        With http
            .Open "GET", key, False
            .send
            html.body.innerHTML = .responseText
        End With
        For Each elem In html.getElementsByClassName("hidden-xs")
            With elem.getElementsByTagName("h1")
                If .Length Then R = R + 1: Cells(R, 1) = .Item(0).innerText
            End With
        Next elem
    Next key
End Sub

How to get newly produced urls outside of a for loop?

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

shahin

Active Member

YasserKhalil

Well-Known Member

shahin

Active Member

YasserKhalil

Well-Known Member

Chihiro

Excel Ninja

shahin

Active Member

Chihiro

Excel Ninja

shahin

Active Member

shahin

Active Member