• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Can't scrape the names from a certain website

shahin

Active Member
Hi there, I've come across such a webpage which seems to me a bit misleading
to scrape. When I go the address "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1"
it takes me to the "untitled" page. After clicking "suchen" it opens a new layout
within this tab and takes me to "untitled1" page. So, the site address is same again "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1".
I would like to scrape the names of that page, as in "Mitarbeiter für die Leerguttrennung (m/w)". Any help would be highly appreciated.
What I wrote so far:

Code:
Sub WebData()
Dim http As New MSXML2.xmlhttp60
Dim html As New htmldocument, source As Object, item As Object

With http
    .Open "GET", "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1", False
    .send
    html.body.innerHTML = .responseText
End With
Set source = html.getElementsByClassName("ng-binding ng-scope")
    For Each item In source
        x = x + 1
        Cells(x, 1) = item.innerText
    Next item
Set html = Nothing: Set source = Nothing
End Sub
Mitarbeiter für die Leerguttrennung
 

Attachments

  • Untitled.jpg
    Untitled.jpg
    161 KB · Views: 16
  • Untitled1.jpg
    Untitled1.jpg
    171.7 KB · Views: 13
Thanks sir Chihiro, for your kind reply. Actually, I noticed it in the first place what you have captured in the developer tool. This time I got stuck not being able to click that "SUCHEN" button. If somehow I can do that programmatically, the rest will not be a big deal. Btw, those two fields do not have to be filled in before clicking that "SUCHEN" button.
 
Tried a lot to find out the "name" parameter of the two blank fields in the input but could not. Perhaps they are hidden.
 
At my end what I could capture is this. I'm very weak in dealing with developer tool.
 

Attachments

  • Image.jpg
    Image.jpg
    119 KB · Views: 3
First of all. You are only interested in "XHR" so instead of "All" use that.

Then once you are on the site. Click on "SUCHEN".
You will see following show up.
upload_2017-4-12_10-45-13.png
 
Sorry sir for being delayed to respond. I was not around my pc. It seems that you are very close to what i expect to have cause following your instruction i tracked down that link and could see bunch of links in a page within json structure. Are those the links which may lead me to the destination i want to make use of? Thanks again sir?
 
No. Names are contained within JSON response.
At "freieBezeichnung" header.

Code:
{
      "hashId" : "x6Qda81_iDdsfjxs_7CBm5elgxgW2Pue_5aoTE5Q07qaAiqjhdU3mCC0CEgfhRQt",
      "hauptberuf" : "Helfer/in - Ver- und Entsorgung",
      "freieBezeichnung" : "Mitarbeiter für die Leerguttrennung (m/w)",
      "arbeitsort" : {
        "plz" : "21073",
        "ort" : "Hamburg",
        "region" : "Hamburg",
        "land" : "Deutschland"
      },
 
Thanks for your response again. Actually i told you about scraping names so that the rest i can do myself. Basically i want to scrape the docs of the page found if the link i have provided above is clicked. And the pics of that portion I'm interested in is attaching now. Please, ignore my linguistic mistakes.
 
You pointed me to some page which looked foreign to me. I never worked with this "JScript" stuff.
 
One more thing to ask you sir. After getting the link which i found from request header i checked and realized that the very link doesn't contain all the necessary links that lead to the destination where i might be able to scrape data from cause i noticed that the page is with lazy loading control. Any workaround? I meant is there any way to get all the main links?
 
Last edited:
Really. You need to study how to read web site source code.

You can see in 1st JSON response following info:
Line41: "id" : "FCT.AKTUALITAET"
Line44: "defaultValue" : 100,
Line17: "id" : "FCT.ANGEBOTSART",
Line23: "name" : "ARBEIT"
Line 5: "id" : "FCT.BEHINDERUNG",
Line14: "name" : "AUS"
Line 3: "offset" : 12,
These are the info needed to construct Request URL for second query.
Constant: https://jobboerse2.arbeitsangentur.de/jobsuche/pc/v1/jobs
2nd query: ?FCT.AKTUALITAET=100&FCT.ANGEBOTSART=ARBEIT&FCTBEHINDERUNG=AUS&offset=12

You can do the same for next one (just change the offset value).

As for were it ends....
Since you didn't specify any parameter, there's total of 844742 records.
This requires 70k + requests
It's not realistic to go through it all.
 
Back
Top