• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How to shake off duplicate links while parsing web data?

shahin

Active Member
I've written some script in vba to parse the links leading to the next page from a torrent site. My script is able to scrape them. However, the issue I'm facing is that couple of duplicate links coming along in the result. If I use "Length" property I can handle it by harcoding some numbers but few websites have lots of pagination links and in that cases it is definitely tedious to handle duplicates with hardcoded numbers to the "Length". My question is whether there is any technique with which I can parse only the unique links?

I've tried with:

Code:
Sub TorrentData()
    Dim http As New XMLHTTP60, html As New HTMLDocument, post As Object

    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With
  
    For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(post, "page") > 0 Then
           x = x + 1: Cells(x, 1) = post.href
        End If
    Next post
End Sub

The one I do not wish to go with because of hardcoding number to the length property. It can handle the duplicates in this case, though!

Code:
Sub TorrentData()
    Dim http As New XMLHTTP60, html As New HTMLDocument

    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With
  
    With html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        For N = 0 To .Length - 3
            If InStr(.item(N).href, "page") > 0 Then
               x = x + 1: Cells(x, 1) = .item(N).href
            End If
        Next N
    End With
End Sub

Pictures with duplicate links:
 

Attachments

  • Untitled.jpg
    Untitled.jpg
    40.9 KB · Views: 9
Last edited:
Thanks sir, for your suggestion and the link. I never used dictionary in my script so I'm totally behind from it.
 
I've somehow written my script using dictionary and it is giving me the perfect result. However, i can't understand how should I arrange them so that it does look like a script other than mine which is definitely a messy one in this case.

Code:
Sub TorrentData()
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim dict As Object, key, val
   
    Set dict = CreateObject("Scripting.Dictionary")
   
    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With
   
    For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(post.href, "page") > 0 Then
            key = post.href
            If Not dict.Exists(key) Then
                dict.Add key, val
            End If
        End If
    Next post
   
    For Each key In dict.Keys
       x = x + 1: Cells(x, 1) = key
    Next key
End Sub
 
Try something like below.

Code:
Sub TorrentData()
    Dim http As New XMLHTTP60, html As New HTMLDocument
   
    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With
   
    With CreateObject("Scripting.Dictionary")
        For Each Post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
            If InStr(Post.href, "page") > 0 Then
                .Item(Post.href) = 1
            End If
        Next Post
        Cells(1, 1).Resize(.Count) = Application.Transpose(.Keys)
    End With
End Sub
 
Absolutely perfect!! However, if I want to use those links in another subroutine to make it even bigger, I got stuck seeing that (.count) property. How can I place a variable there to store the links? I meant, if I take out "Cells(1, 1).Resize(.Count)" and place there "links" as string variable, will that links variable be able to store those newly produced hrefs?
 
To be more specific, you can take here a look:
Code:
Sub TorrentData()
    Const baselink = "https://yts.ag"
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim htm As New HTMLDocument, links As String, x As Long
    Dim dict As Object, key, val
  
    Set dict = CreateObject("Scripting.Dictionary")
    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With
  
   GetData x, html
  
    For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(post.href, "page") > 0 Then
            key = post.href
            If Not dict.Exists(key) Then
                dict.Add key, val
            End If
        End If
    Next post
    For Each key In dict.Keys
       links = baselink & Split(key, ":")(1)     'I'm talking about this line
  
    With http
        .Open "GET", links, False
        .send
        htm.body.innerHTML = .responseText
    End With
  
    GetData x, htm
  
    Next key
End Sub

Sub GetData(ByRef x As Long, ByRef htm As HTMLDocument)
    Dim post As HTMLHtmlElement
  
    For Each post In htm.getElementsByClassName("browse-movie-bottom")
        With post.getElementsByClassName("browse-movie-title")
            If .Length Then x = x + 1: Cells(x, 1) = .item(0).innerText
        End With
    Next post
End Sub
 
Last edited:
Try...
Code:
Sub TorrentData()
    Const baselink = "https://yts.ag"
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim htm As New HTMLDocument, x As Long
    Dim dic As Object, Key
   
    Set dic = CreateObject("Scripting.Dictionary")
    With http
        .Open "GET", "https://yts.ag/browse-movies", False
        .send
        html.body.innerHTML = .responseText
    End With
 
  GetData x, html
 
    For Each post In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(post.href, "page") > 0 Then
            dic.Item(post.href) = baselink & Split(post.href, ":")(1)
        End If
    Next post
    For Each Key In dic.Keys
        With http
            .Open "GET", dic.Item(Key), False
            .send
            htm.body.innerHTML = .responseText
        End With
        GetData x, htm
    Next Key
End Sub
Sub GetData(ByRef x As Long, ByRef htm As HTMLDocument)
    Dim post As HTMLHtmlElement
 
    For Each post In htm.getElementsByClassName("browse-movie-bottom")
        With post.getElementsByClassName("browse-movie-title")
            If .Length Then x = x + 1: Cells(x, 1) = .Item(0).innerText
        End With
    Next post
End Sub
 
Back
Top