Can't scrape the names from a certain website

shahin · Apr 12, 2017

Hi there, I've come across such a webpage which seems to me a bit misleading
to scrape. When I go the address "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1"
it takes me to the "untitled" page. After clicking "suchen" it opens a new layout
within this tab and takes me to "untitled1" page. So, the site address is same again "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1".
I would like to scrape the names of that page, as in "Mitarbeiter für die Leerguttrennung (m/w)". Any help would be highly appreciated.
What I wrote so far:

Code:

Sub WebData()
Dim http As New MSXML2.xmlhttp60
Dim html As New htmldocument, source As Object, item As Object

With http
    .Open "GET", "https://jobboerse2.arbeitsagentur.de/jobsuche/?s=1", False
    .send
    html.body.innerHTML = .responseText
End With
Set source = html.getElementsByClassName("ng-binding ng-scope")
    For Each item In source
        x = x + 1
        Cells(x, 1) = item.innerText
    Next item
Set html = Nothing: Set source = Nothing
End Sub

Mitarbeiter für die Leerguttrennung

Chihiro · Apr 12, 2017

Same thing as your last question in thread below.

http://chandoo.org/forum/threads/pr...om-a-webpage-stored-within-an-envelope.33863/

Use Developer tool and see what requests are sent via XHR.

You'll find JSON response that populates container on the site easily enough.

shahin · Apr 12, 2017

Thanks sir Chihiro, for your kind reply. Actually, I noticed it in the first place what you have captured in the developer tool. This time I got stuck not being able to click that "SUCHEN" button. If somehow I can do that programmatically, the rest will not be a big deal. Btw, those two fields do not have to be filled in before clicking that "SUCHEN" button.

Chihiro · Apr 12, 2017

Yes I know. Check the Parameter field in the same line.

shahin · Apr 12, 2017

Tried a lot to find out the "name" parameter of the two blank fields in the input but could not. Perhaps they are hidden.

Chihiro · Apr 12, 2017

Just check Request URL.

shahin · Apr 12, 2017

At my end what I could capture is this. I'm very weak in dealing with developer tool.

Chihiro · Apr 12, 2017

First of all. You are only interested in "XHR" so instead of "All" use that.

Then once you are on the site. Click on "SUCHEN".
You will see following show up.

shahin · Apr 12, 2017

Sorry sir for being delayed to respond. I was not around my pc. It seems that you are very close to what i expect to have cause following your instruction i tracked down that link and could see bunch of links in a page within json structure. Are those the links which may lead me to the destination i want to make use of? Thanks again sir?

shahin · Apr 12, 2017

Oh my god!! Those are the links of that page i was trying to reach, as in "https://jobboerse.arbeitsagentur.de...xZDg7DHuddNnLrNljCO5_p8gwX-RB3qpXk9UBcMnZ_jYw". Now, how can i scrape those links out of that page so that i can utilize it to satisfy my need. Thanks sir.

Chihiro · Apr 12, 2017

No. Names are contained within JSON response.
At "freieBezeichnung" header.

Code:

{
      "hashId" : "x6Qda81_iDdsfjxs_7CBm5elgxgW2Pue_5aoTE5Q07qaAiqjhdU3mCC0CEgfhRQt",
      "hauptberuf" : "Helfer/in - Ver- und Entsorgung",
      "freieBezeichnung" : "Mitarbeiter für die Leerguttrennung (m/w)",
      "arbeitsort" : {
        "plz" : "21073",
        "ort" : "Hamburg",
        "region" : "Hamburg",
        "land" : "Deutschland"
      },

shahin · Apr 12, 2017

Thanks for your response again. Actually i told you about scraping names so that the rest i can do myself. Basically i want to scrape the docs of the page found if the link i have provided above is clicked. And the pics of that portion I'm interested in is attaching now. Please, ignore my linguistic mistakes.

shahin · Apr 12, 2017

Chihiro · Apr 12, 2017

Either case just follow the method Marc L pointed out to parse JSON (using ScriptControl).

http://chandoo.org/forum/threads/string-manipulation-expecting-a-better-solution.33528/#post-199228

http://chandoo.org/forum/threads/list-all-values-from-giant-string.29944/#post-178884

shahin · Apr 12, 2017

Can't find any way to scrape links from this page "https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs". If you give me a hint I might be able to. Terribly sorry for my importunate request.

shahin · Apr 12, 2017

You pointed me to some page which looked foreign to me. I never worked with this "JScript" stuff.

Chihiro · Apr 12, 2017

Hmm? First linked post is in direct response to one of your question.

Otherwise, you can just use VBA-JSON mentioned by me in the link I gave you in post #2.

Or using RegExp as jindon demonstrated in 2nd link in post #14.

NOTE: ScriptControl in some cases can expose your system to malicious JS code. See detail in link.
http://stackoverflow.com/questions/6627652/parsing-json-in-excel-vba/30494373#30494373

shahin · Apr 12, 2017

One more thing to ask you sir. After getting the link which i found from request header i checked and realized that the very link doesn't contain all the necessary links that lead to the destination where i might be able to scrape data from cause i noticed that the page is with lazy loading control. Any workaround? I meant is there any way to get all the main links?

Chihiro · Apr 12, 2017

Not sure what you mean.

shahin · Apr 12, 2017

I meant if one link is "https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs" then
another is "https://jobboerse2.arbeitsagentur.d...CT‌.ANGEBOTSART=ARBEIT&‌FCT.BEHINDERUNG=AUS" . So ,there might be few others. That is what i meant.

shahin · Apr 12, 2017

Actually, the links incremented like these but i don't know where it ends i meant the last link.
"https://jobboerse2.arbeitsagentur.de/jobsuche/pc/v1/jobs"
"https://jobboerse2.arbeitsagentur.d...00&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=12"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=24"
"https://jobboerse2.arbeitsagentur.d...EBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&offset=36"

Chihiro · Apr 13, 2017

Really. You need to study how to read web site source code.

You can see in 1st JSON response following info:
Line41: "id" : "FCT.AKTUALITAET"
Line44: "defaultValue" : 100,
Line17: "id" : "FCT.ANGEBOTSART",
Line23: "name" : "ARBEIT"
Line 5: "id" : "FCT.BEHINDERUNG",
Line14: "name" : "AUS"
Line 3: "offset" : 12,
These are the info needed to construct Request URL for second query.
Constant: https://jobboerse2.arbeitsangentur.de/jobsuche/pc/v1/jobs
2nd query: ?FCT.AKTUALITAET=100&FCT.ANGEBOTSART=ARBEIT&FCTBEHINDERUNG=AUS&offset=12

You can do the same for next one (just change the offset value).

As for were it ends....
Since you didn't specify any parameter, there's total of 844742 records.
This requires 70k + requests
It's not realistic to go through it all.

Can't scrape the names from a certain website

Active Member

Attachments

Excel Ninja

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Attachments

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja

Active Member

Excel Ninja

Active Member

Active Member

Excel Ninja