• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Web data Scrapping

Hi guys,
If anyone help me to resolve my issue.. i have Vba Script for Data scrapping from Caterpillar website, but problem is this code not working properly,
Because the website is Java script type, so i can't get text from site, so if anyone help me to resolve my issue.
I have attached the sample file your ref,

the Changes in field are

Cells(i, 4), Cells(i, 5), Cells(i, 6)

and Example web links in Example sheet


Output Should be
* Product Title Header
* Description
* Specification
* Compatablity Models


with Regards

Jeyachandran
 

Attachments

  • Data Scraper-Caterpillar XML.V1 - Copy.xlsm
    78.6 KB · Views: 10
Site is based on JSP. You'll need to pilot IE11 and scrape data from the page. Instead of using MSXML2.

You'll find many examples on this site and elsewhere.

Ex:
Code:
Dim wb As New InternetExplorer
Dim doc As New MSHTML.HTMLDocument

Dim sURL As String
Dim lastrow As Long
Dim i As Integer


lastrow = Sheet1.Cells(Rows.Count, "B").End(xlUp).Row

For i = 2 To lastrow

sURL = Cells(i, 2)
wb.Navigate sURL
While wb.ReadyState <> 4 Or wb.Busy: DoEvents: Wend

Set doc = wb.Document

FYI - Your querySelector is bit off. And you really don't need to loop through el in doc.getElementsByTagName("p").

Ex: For getting description...
Code:
Cells(i, 4) = doc.querySelector("[id*='product_shortdescription']").innerText

This should give you enough info to modify your code.

Or for product title and description. You can just scrape info off of header portion.
Ex:
Code:
Debug.Print doc.head.innerHTML
Will show you metanames... "keywords" would contain product title, "description" will contain short description.
Then you'd use string manipulation on the header string to extract the info desired.

You'll need to study each page's structure and elements and identify which id etc holds the information that you are after.
 
Back
Top