• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Trouble Parsing Image and Price

shahin

Active Member
Hi there! I've made a parser which is pulling 3 categories excluding Image and price out of five categories. I could not find any "img" tag there. Moreover the page contains lots of data but only 25 are being scraped. The page is loading slowly to display the full content though! Here is the code I've written. Any help would be a great help to me. Thanks in advance.

Code:
Sub VariousWines()
Const URL = "https://www.vivino.com/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4%3D"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, posts As Object, topic As HTMLHtmlElement, post As HTMLHtmlElement
Dim i As Long, x As Long

x = 2

http.Open "GET", URL, False
http.send
html.body.innerHTML = http.responseText

Set topics = html.getElementsByClassName("card card-lg")

    For i = 0 To topics.Length - 1
    Set topic = topics(i)
        Cells(x, 1).Value = topic.getElementsByClassName("bold")(0).innerText
        Cells(x, 2).Value = topic.getElementsByClassName("text-block wine-card__region")(0).innerText
        Cells(x, 3).Value = topic.getElementsByClassName("text-inline-block light average__number")(0).innerText
        Cells(x, 4).Value = topic.getElementsByClassName("wine-price-value")(0).innerText
        On Error Resume Next
        Cells(x, 5).Value = topic.getElementsByClassName("wine-card__image")(0).getElementsByTagName("img")(0).src
        x = x + 1
    Next i
  
End Sub
 
Last edited:
No "img", "src" I could find in the element. FYC, html elements for the Image portion is given below:

Code:
<div class="wine-card__image-wrapper">
<a href="/wineries/tschida/wines/angerhof-eiswein-gruner-veltliner-2012">
<figure class="wine-card__image" style="background-image: url(//images.vivino.com/thumbs/qlER3oggQVKh1FZn7YGxZg_375x500.jpg)">
<div class="image-inner"></div>
</figure>
</a>
</div>
 
Hi Marc L, nice to find you in the loop. If I could understand you properly then you meant this. It is not working either.

Code:
Set topics = html.getElementsByClassName("card card-lg")

    For i = 0 To topics.Length - 1
    Set topic = topics(i)
        Cells(x, 1).Value = topic.getElementsByClassName("wine-price-value")(0).innerText
        Cells(x, 2).Value = topic.getElementsByClassName("wine-card__image-wrapper")(0).getElementsByTagName("img")(0).src
        x = x + 1
    Next i
 

So like it does not work on your side, you must change your strategy !

As per example search for a tag name …
 
Thanks Marc L for your kind reply. In case of image I have clearly messed up with my code but what about "The Price" portion? Have I not rightly written it?

Code:
Set topics = html.getElementsByClassName("wine-price-value")
For Each topic In topics
    Cells(x, 1).Value = topic.innerText
    x = x + 1
Next topic
 
FYC, Here is the html elements for the price:
Code:
<div class="text-inline-block header-large light wine-price average__number">
<span class="wine-price-prefix">$</span>
<span class="wine-price-value">29.98</span>
<span class="wine-price-suffix"></span>
</div>
 
For image... since URL is within style element in Class, easiest way is to get innerHTML of the wrapper and do string manipulation.
Ex. To extract entire string in Style="xxx"
Code:
Cells(x, 5).Value = Split(Split(topic.getElementsByClassName("wine-card__image-wrapper")(0).innerHTML, "style=""")(1), """>")

As for price. It's likely filled by script. When page is loaded in IE, values are updated via script. However, responseText will not fire script to fill those values. Thus they are empty.

Quick and dirty workaround is to load responseText to IE and read values from there.
 
Thanks sir Chihiro for the demo. Btw, I have got the basic portion from some NATHAN and edited the code for working nicely to parse image. Here it is:

Code:
Set topics = html.getElementsByClassName("wine-card__image-wrapper")
For Each topic In topics
 Cells(x, 1) = "http:" & Replace(Replace(topic.Children(0).Children(0).Style.backgroundImage, "url(", ""), ")", "")
 x = x + 1
Next topic
 
Another chunk of html elements out of which I can't sift through the Img and src to get the Image. Here is the portion:

Code:
<div class="swipe" style="visibility: visible;"><div class="swipe-wrap" style="width: 1500px;"><div data-index="0" style="width: 300px; left: 0px; transition-duration: 0ms; transform: translate(0px, 0px);"><img alt=""class="" src="https://images.craigslist.org/00l0l_hW1X6OeifVZ_300x300.jpg"></div><div data-index="1" style="width: 300px; left: -300px; transition-duration: 0ms;
transform: translate(300px, 0px);"></div><div data-index="2" style="width: 300px; left: -600px; transition-duration: 0ms; transform: translate(300px, 0px);"></div><div data-index="3" style="width: 300px; left: -900px; transition-duration: 0ms; transform: translate(300px, 0px);"></div>'<div data-index="4" style="width: 300px; left: -1200px; transition-duration: 0ms; transform: translate(-300px, 0px);"></div></div></div>
 
Ain't it possible to parse images using REGEX in this case? Code and html elements are pasted below:
Code:
Code:
Set topics = html.getElementsByClassName("result-image gallery")
For Each topic In topics
     Cells(x,1)= topic.getElementsByTagName("img")(0).src
     x = x +1
Next topic
Elements:
Code:
<a href="/mnh/atq/6033903864.html"class="result-image gallery" data-ids="1:00l0l_auIVAPKuweh"><img alt=""class="" src="https://images.craigslist.org/00l0l_auIVAPKuweh_300x300.jpg"><span class="result-price">$120</span></a>
 
Dear Marc L, you must be right with what you said but the thing is that I am not good at understanding your point oftentimes because of my incapability. However, in this case even if I use the basic code to get images from a page, it fetches only two "gif" files but not a single "jpg". I tried with:

PageLink: https://newyork.craigslist.org/search/ata

Code:
Set topics = html.getElementsByTagName("img")
For Each topic In topics
    Cells(x, 1) = topic.src
    x = x + 1
Next topic
 

I deleted my previous code thinking I was wrong 'cause of late binding
for objects but trying now early binding I get same result :
getElementsByClassName collection is always empty …

Post a complete code, it's far easier to try to follow you !
And indicating added references.

As a reminder : a valid code for a webpage often is not on another one !
At each new webpage, you must start from blank and study its html code …
 
@Marc L,
Hi!! Terribly sorry to say that I was not around my pc that is why I got late to respond. I might have mistaken choosing the right class. Anyways, here is the code :

Code:
Sub CraigImg()
Const URL = "https://newyork.craigslist.org/search/ata"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, topic As Object
Dim x As Long
x = 2
http.Open "GET", URL, False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("result-image gallery")
    For Each topic In topics
        Cells(x, 1) = topic.getElementsByTagName("img")(0).src
        x = x + 1
    Next topic
End Sub
 

So save responseText to a text file and check it over your code
- or just use InStr function or Like operator -
it will be easier to understand !

Once understood, well read Chihiro's post #8 …
 
It was hard to scrape "price" from "https://www.vivino.com/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4=" this site until I find selenium webdriver. It scrapes the "Image" and "Price" very smoothly. Here is the code:
Before running this code make sure "seleniumWrapper type library" is checked in the reference library.

Code:
Sub ParsingVivino()
Dim driver As SeleniumWrapper.WebDriver
Dim posts As Object, post As Object

Set driver = New SeleniumWrapper.WebDriver
driver.Start "Phantomjs", "https://www.vivino.com"
driver.get "/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4%3D"
Set posts = driver.findElementsByClassName("card-lg")

For Each post In posts
    i = i + 1
    Cells(i, 1) = post.findElementByClassName("link-color-alt-grey").Text
    Cells(i, 2) = post.findElementByClassName("wine-price-value").Text
    Cells(i, 3) = Split(Split(post.findElementByClassName("wine-card__image").getAttribute("style"), "background-image: url(")(1), ");")(0)
Next post
Set driver = Nothing: Set posts = Nothing
End Sub
 
Last edited:
It also solves the issue that I was facing while parsing Images from craigslist. Here is the code:
Code:
Sub CraigslistImage()
Dim driver As SeleniumWrapper.WebDriver
Dim posts As Object, post As Object

Set driver = New SeleniumWrapper.WebDriver
driver.Start "Phantomjs", "https://newyork.craigslist.org/search"
driver.get "/ata"
Set posts = driver.findElementsByClassName("swipe-wrap")

For Each post In posts
    i = i + 1
    Cells(i, 1) = post.findElementByTagName("img").getAttribute("src")
Next post
Set driver = Nothing: Set posts = Nothing
End Sub
 
Back
Top