Trouble Parsing Image and Price

shahin · Mar 5, 2017

Hi there! I've made a parser which is pulling 3 categories excluding Image and price out of five categories. I could not find any "img" tag there. Moreover the page contains lots of data but only 25 are being scraped. The page is loading slowly to display the full content though! Here is the code I've written. Any help would be a great help to me. Thanks in advance.

Code:

Sub VariousWines()
Const URL = "https://www.vivino.com/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4%3D"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, posts As Object, topic As HTMLHtmlElement, post As HTMLHtmlElement
Dim i As Long, x As Long

x = 2

http.Open "GET", URL, False
http.send
html.body.innerHTML = http.responseText

Set topics = html.getElementsByClassName("card card-lg")

    For i = 0 To topics.Length - 1
    Set topic = topics(i)
        Cells(x, 1).Value = topic.getElementsByClassName("bold")(0).innerText
        Cells(x, 2).Value = topic.getElementsByClassName("text-block wine-card__region")(0).innerText
        Cells(x, 3).Value = topic.getElementsByClassName("text-inline-block light average__number")(0).innerText
        Cells(x, 4).Value = topic.getElementsByClassName("wine-price-value")(0).innerText
        On Error Resume Next
        Cells(x, 5).Value = topic.getElementsByClassName("wine-card__image")(0).getElementsByTagName("img")(0).src
        x = x + 1
    Next i
  
End Sub

shahin · Mar 6, 2017

No "img", "src" I could find in the element. FYC, html elements for the Image portion is given below:

Code:

<div class="wine-card__image-wrapper">
<a href="/wineries/tschida/wines/angerhof-eiswein-gruner-veltliner-2012">
<figure class="wine-card__image" style="background-image: url(//images.vivino.com/thumbs/qlER3oggQVKh1FZn7YGxZg_375x500.jpg)">
<div class="image-inner"></div>
</figure>
</a>
</div>

Marc L · Mar 6, 2017

Hi !

So just mod your code to use the correct class name …

shahin · Mar 6, 2017

Hi Marc L, nice to find you in the loop. If I could understand you properly then you meant this. It is not working either.

Code:

Set topics = html.getElementsByClassName("card card-lg")

    For i = 0 To topics.Length - 1
    Set topic = topics(i)
        Cells(x, 1).Value = topic.getElementsByClassName("wine-price-value")(0).innerText
        Cells(x, 2).Value = topic.getElementsByClassName("wine-card__image-wrapper")(0).getElementsByTagName("img")(0).src
        x = x + 1
    Next i

Marc L · Mar 6, 2017

So like it does not work on your side, you must change your strategy !

As per example search for a tag name …

shahin · Mar 6, 2017

Thanks Marc L for your kind reply. In case of image I have clearly messed up with my code but what about "The Price" portion? Have I not rightly written it?

Code:

Set topics = html.getElementsByClassName("wine-price-value")
For Each topic In topics
    Cells(x, 1).Value = topic.innerText
    x = x + 1
Next topic

shahin · Mar 6, 2017

FYC, Here is the html elements for the price:

Code:

<div class="text-inline-block header-large light wine-price average__number">
<span class="wine-price-prefix">$</span>
<span class="wine-price-value">29.98</span>
<span class="wine-price-suffix"></span>
</div>

Chihiro · Mar 6, 2017

For image... since URL is within style element in Class, easiest way is to get innerHTML of the wrapper and do string manipulation.
Ex. To extract entire string in Style="xxx"

Code:

Cells(x, 5).Value = Split(Split(topic.getElementsByClassName("wine-card__image-wrapper")(0).innerHTML, "style=""")(1), """>")

As for price. It's likely filled by script. When page is loaded in IE, values are updated via script. However, responseText will not fire script to fill those values. Thus they are empty.

Quick and dirty workaround is to load responseText to IE and read values from there.

Marc L · Mar 6, 2017

As always observing how webpage works before to start any code …

Chihiro · Mar 6, 2017

Yep. That's usually key to scraping web sites. Not so much VBA skill.

@shahin
You can see in image how price info is brought in.

shahin · Mar 6, 2017

Thanks sir Chihiro for the demo. Btw, I have got the basic portion from some NATHAN and edited the code for working nicely to parse image. Here it is:

Code:

Set topics = html.getElementsByClassName("wine-card__image-wrapper")
For Each topic In topics
 Cells(x, 1) = "http:" & Replace(Replace(topic.Children(0).Children(0).Style.backgroundImage, "url(", ""), ")", "")
 x = x + 1
Next topic

shahin · Mar 6, 2017

Another chunk of html elements out of which I can't sift through the Img and src to get the Image. Here is the portion:

Code:

<div class="swipe" style="visibility: visible;"><div class="swipe-wrap" style="width: 1500px;"><div data-index="0" style="width: 300px; left: 0px; transition-duration: 0ms; transform: translate(0px, 0px);"><img alt=""class="" src="https://images.craigslist.org/00l0l_hW1X6OeifVZ_300x300.jpg"></div><div data-index="1" style="width: 300px; left: -300px; transition-duration: 0ms;
transform: translate(300px, 0px);"></div><div data-index="2" style="width: 300px; left: -600px; transition-duration: 0ms; transform: translate(300px, 0px);"></div><div data-index="3" style="width: 300px; left: -900px; transition-duration: 0ms; transform: translate(300px, 0px);"></div>'<div data-index="4" style="width: 300px; left: -1200px; transition-duration: 0ms; transform: translate(-300px, 0px);"></div></div></div>

shahin · Mar 11, 2017

Ain't it possible to parse images using REGEX in this case? Code and html elements are pasted below:
Code:

Code:

Set topics = html.getElementsByClassName("result-image gallery")
For Each topic In topics
     Cells(x,1)= topic.getElementsByTagName("img")(0).src
     x = x +1
Next topic

Elements:

Code:

<a href="/mnh/atq/6033903864.html"class="result-image gallery" data-ids="1:00l0l_auIVAPKuweh"><img alt=""class="" src="https://images.craigslist.org/00l0l_auIVAPKuweh_300x300.jpg"><span class="result-price">$120</span></a>

shahin · Mar 11, 2017

Dear Marc L, you must be right with what you said but the thing is that I am not good at understanding your point oftentimes because of my incapability. However, in this case even if I use the basic code to get images from a page, it fetches only two "gif" files but not a single "jpg". I tried with:

PageLink: https://newyork.craigslist.org/search/ata

Code:

Set topics = html.getElementsByTagName("img")
For Each topic In topics
    Cells(x, 1) = topic.src
    x = x + 1
Next topic

Marc L · Mar 11, 2017

I deleted my previous code thinking I was wrong 'cause of late binding
for objects but trying now early binding I get same result :
getElementsByClassName collection is always empty …

Post a complete code, it's far easier to try to follow you !
And indicating added references.

As a reminder : a valid code for a webpage often is not on another one !
At each new webpage, you must start from blank and study its html code …

shahin · Mar 11, 2017

@Marc L,
Hi!! Terribly sorry to say that I was not around my pc that is why I got late to respond. I might have mistaken choosing the right class. Anyways, here is the code :

Code:

Sub CraigImg()
Const URL = "https://newyork.craigslist.org/search/ata"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument
Dim topics As Object, topic As Object
Dim x As Long
x = 2
http.Open "GET", URL, False
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementsByClassName("result-image gallery")
    For Each topic In topics
        Cells(x, 1) = topic.getElementsByTagName("img")(0).src
        x = x + 1
    Next topic
End Sub

Marc L · Mar 12, 2017

Does this code work on your side ?

shahin · Mar 12, 2017

Nope. Not at all.

shahin · Mar 12, 2017

Cross posted in: http://www.ozgrid.com/forum/showthread.php?t=203269

Marc L · Mar 12, 2017

So save responseText to a text file and check it over your code
- or just use InStr function or Like operator -
it will be easier to understand !

Once understood, well read Chihiro's post #8 …

shahin · Mar 20, 2017

Hi there!! Found the solution.
Craigslist renders images using data-ids. HTTP request will not bring any result.
Using data-ids and concatenating url to this data-ids will lead to get the image source.

Refined Image : https://images.craigslist.org/01717_aF0XytFCiTZ_300x300.jpg

Format for the image used in craigslist : https://images.craigslist.org/{data-id}__300x300.jpg

Marc L · Mar 20, 2017

So now you can add images as hyperlinks on cells …

shahin · Mar 20, 2017

Yep, successfully.

shahin · Apr 2, 2017

It was hard to scrape "price" from "https://www.vivino.com/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4=" this site until I find selenium webdriver. It scrapes the "Image" and "Price" very smoothly. Here is the code:
Before running this code make sure "seleniumWrapper type library" is checked in the reference library.

Code:

Sub ParsingVivino()
Dim driver As SeleniumWrapper.WebDriver
Dim posts As Object, post As Object

Set driver = New SeleniumWrapper.WebDriver
driver.Start "Phantomjs", "https://www.vivino.com"
driver.get "/explore?e=eJzLLbI1UcvNzLM1NFDLTaywNTFQS660LS1WK7ZNTlQrS7YtKSpNVSsviY61NYRQRhDKGEKZQChzqJwJABUoGa4%3D"
Set posts = driver.findElementsByClassName("card-lg")

For Each post In posts
    i = i + 1
    Cells(i, 1) = post.findElementByClassName("link-color-alt-grey").Text
    Cells(i, 2) = post.findElementByClassName("wine-price-value").Text
    Cells(i, 3) = Split(Split(post.findElementByClassName("wine-card__image").getAttribute("style"), "background-image: url(")(1), ");")(0)
Next post
Set driver = Nothing: Set posts = Nothing
End Sub

shahin · Apr 2, 2017

It also solves the issue that I was facing while parsing Images from craigslist. Here is the code:

Code:

Sub CraigslistImage()
Dim driver As SeleniumWrapper.WebDriver
Dim posts As Object, post As Object

Set driver = New SeleniumWrapper.WebDriver
driver.Start "Phantomjs", "https://newyork.craigslist.org/search"
driver.get "/ata"
Set posts = driver.findElementsByClassName("swipe-wrap")

For Each post In posts
    i = i + 1
    Cells(i, 1) = post.findElementByTagName("img").getAttribute("src")
Next post
Set driver = Nothing: Set posts = Nothing
End Sub

Trouble Parsing Image and Price

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Excel Ninja

Excel Ninja

Active Member

Active Member

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Active Member