• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Web Data Pulling - Need help in improving code.

shrivallabha

Excel Ninja
Hello All,

I have been working on a code for my friend to pull information from website (financial). Putting code together was not an issue.

It is two phased approach:
1. It pulls portfolio information from listed mutual funds.
2. Then it scrapes individual share prices from the same site.

Site is: http://www.moneycontrol.com/

Question: The code works but it is slow. Painfully slow for phase 2 as there are hundreds of shares. Currently I am getting shares URL in phase 1 pull and then scraping using the same MSXML2.XMLHTTP60 object.

Obviously, I am having few own ideas to improve efficiency.
- Adding Dictionary to minimize scaping for shares by de-duplicating.
- Pulling complete dump and having a mapping table to map to the share values I get.

However, I'd like to know from experts to whether current code can be improved upon by using some other objects or methods?

Attachment:
In sheet 1 "MF-INDEX", you need to put Mutual Fund Name. I have kept one.
In sheet 2 output is taken from above mentioned site.

Thank you very much in advance for your ideas and inputs! If any clarification is required then let me know.
 

Attachments

  • Pull_Info.xlsm
    27.9 KB · Views: 18
What happens when you kick out this below portion from your script and run again. Doesn't it fetch you the accurate results along with performing a lot faster.
Code:
#If VBA7 Then
    Public Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As LongPtr) 'For 64 Bit Systems
#Else
    Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long) 'For 32 Bit Systems
#End If  

Do While objHTTP.readyState <> 4 ''this three liner loop is available in two places in your script: just shake it off
   Sleep 1000
Loop

So far my knowledge goes, there is no changes the above portion of code could bring about when they are bind with "Msxml2.xmlhttp" instead of "IE".
 
At a glance, you are writing result back to sheet for each loop iteration.

I'd probably build array in code and put it back to sheet in one shot.

As well, you are scraping using elements. Try string manipulation, it may end up being faster than .getElementbyXXX method.

Edit: Oh wait, just noticed second send. within loop. Looks like bottle neck is due to VBA limitation. Since code can only send one request at a time.

Though I'm not too familiar with it, python can accommodate for faster scraping via multithreaded processing.

Here's link to the subject.
https://medium.com/python-pandemoni...scraper-by-using-multiprocessing-f2f4ef838686
 
Last edited:
Hi !

I have not to much time but two points :

• Tools exist to use Python in Excel …

• If using requests is possible with your site,
you can multiprocess requests calling from VBA same VBScript
for each share as yet shown in this thread !
Not necessary to create the VBScript file via VBA but just calling it
as it can be created manually …
As not necessary the second part to count down requests,
it was just for the demonstration …
 
Thank you guys for your inputs. I will try to work on some of them during this week and let you know.

Off late, I have joined Stack Overflow and have answered few questions on Excel + VBA. But I have seen many posts on Python dealing with Excel. I don't do Python scripting but let me see.
 
If using xlwings I'd recommend Anaconda as interpreter.
It has all the modules needed by xlwings and much easier to set up development environment than most other interpreter.
 
I've been playing with python a bit (Anaconda, Python 3.6.3). Was successful in returning result to Excel using xlwing.

Though I'm having issue with multiprocessing, so speed can probably improve.

Current code below.
Code:
# -*- coding: utf-8 -*-

import re
import requests
import xlwings as xw
from bs4 import BeautifulSoup
from time import sleep

def getlisting(url):
    headers = {
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    r = requests.get(url, headers=headers, timeout=10)
    if r.status_code == 200:
        html = r.text
        soup = BeautifulSoup(html, 'lxml')
        table = soup.find("table", {"class" : "tblporhd"})
        refs = table.find_all("a", {"class" : ""})
        links = ['http://www.moneycontrol.com' + link.get('href').replace(' ','') for link in refs]
        return links

def parseinfo(surl):
    headers = {
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    sr = requests.get(surl, headers=headers, timeout=10)
#    sleep(2)
    if sr.status_code == 200:
        print('Processing..' + surl)
        shtml = sr.text
        ssoup = BeautifulSoup(shtml, 'lxml')
        tick = ssoup.find("span", id=re.compile('Nse_Prc_tick')).text
        if tick == None:
            tick = ssoup.find("span", id=re.compile('Bse_Prc_tick')).text + ' (Bse)'
        return tick
mylinks = None
mylinks = getlisting('http://www.moneycontrol.com/india/mutualfunds/mfinfo/portfolio_holdings/MMO029')

mytick = [parseinfo(link) for link in mylinks]
xw.Range('A1').options(transpose=True).value = mytick
""" Code piece to export soup to text file """
#with open('C:\\test\\out2.txt', "wb") as out:
#    out.write(mysoup.encode("utf-8"))

""" Function to get data from table including header """
#def parse_table(table):
#    return [
#        [cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
#          for row in table.find_all('tr')
#    ]

May try out another method (asyncio) when I find time.
 
bah, aiohttp that I was using in conjunction with asyncio has too many bugs/issues at the moment.

For some odd reason, it works in most cases, but fails if URL list that I scrape is certain number (ex: 8 will fail, 9 works etc). And if 1 URL fails, entire asyncio process fails. :(

Oh well, may be some other method.
 
@sir Chihiro, have you tried with grequests and openpyxl? grequests is for asynchronous call and openpyxl for writing the collected items in an excel file. I found both of them easy to play with. FYI, I have read few places about asyncio that it doesn't serve the purpose the way it should be because allegedly it sends requests in a blocking manner. However, the best option is always SCRAPY when it comes to send requests asynchronously.
 
Hi Shrivallabha ,

I doubt that anything can improve the execution time of this application substantially.

1. You are retrieving information on the portfolio of a mutual fund ; this is one outer loop which takes some time ; it is however , not the real issue , since the web request happens only once.

2. The real time consuming code is the inner loop where for each share in the portfolio the market price is being retrieved using a web request.

My timing for your original code , as it is , was 52 seconds.

The code itself can be improved a little with the following :

1. Eliminating both Sleep 1000 lines

2. Using Application.ScreenUpdating = False

3. Shifting the code for outputting columns 4 through 8 outside the inner loop.

Doing all of the above reduced the timing to 41 seconds.

For 26 web requests , this seems a reasonable amount of time.

The only substantial improvement will come if we can eliminate web requests , which as you have said , can be done by storing each share market price in a dictionary , so that as you progress through the funds , retrieving the share prices is done more often from the dictionary than from the web.

Narayan
 
pull information from website (financial).
Another point : using XMLHTTP ActiveX makes request faster (instant)
for the second call of a same ticker (in case of duplicate) as it works from cache but it can be a mess for a financial site if the result needs accuracy
for example when a value change between requests …
For accuracy WinHttp ActiveX is appropriate as it does not work from cache.
The reason why it's obvious for financial scraping
to add in result a time stamp for each value …
 
FYI, I have read few places about asyncio that it doesn't serve the purpose the way it should be because allegedly it sends requests in a blocking manner.

Above is true, if the process you are running doesn't support asyncio (and when you are not using .run_in_executor). You can test it by running simple script to read response of multiple sites.

I'm currently playing with .run_in_executor along with urllib.requests to get stable result. Struggling a bit to wrap my head around recursive co-routine.
 
In sheet 1 "MF-INDEX", you need to put Mutual Fund Name. I have kept one.
In sheet 2 output is taken from above mentioned site.
Could you join a before workbook with at least 30 codes
(much more if possible & with individual shares duplicates) and
a result workbook with data filled for the two first codes …
 
Thank you so much everyone for your support and suggestions. I have learned many things in this thread including some web scraping basics related stuff. Thank you @shahin @Chihiro @Marc L @NARAYANK991 for contributing!

I couldn't learn python in this week so that option remained out. Instead I showed it to the person who wanted it and he was okay with the time being consumed. As manually it would take him longer. In fact, much longer.

Accuracy between requests is not paramount here as the user intends to run it once in a quarter!

I have done some work and optimized the code as much as possible with inputs here and a related question on Stack Overflow.

Attached herewith is the final worked out version.
 

Attachments

  • PullDataFromWeb.xlsm
    46 KB · Views: 19
Hello All,

I have been working on a code for my friend to pull information from website (financial). Putting code together was not an issue.

It is two phased approach:
1. It pulls portfolio information from listed mutual funds.
2. Then it scrapes individual share prices from the same site.

Site is: http://www.moneycontrol.com/

Question: The code works but it is slow. Painfully slow for phase 2 as there are hundreds of shares. Currently I am getting shares URL in phase 1 pull and then scraping using the same MSXML2.XMLHTTP60 object.

Obviously, I am having few own ideas to improve efficiency.
- Adding Dictionary to minimize scaping for shares by de-duplicating.
- Pulling complete dump and having a mapping table to map to the share values I get.

However, I'd like to know from experts to whether current code can be improved upon by using some other objects or methods?

Attachment:
In sheet 1 "MF-INDEX", you need to put Mutual Fund Name. I have kept one.
In sheet 2 output is taken from above mentioned site.

Thank you very much in advance for your ideas and inputs! If any clarification is required then let me know.

Interesting note: When I ran the vba in PullDataFromWeb, it seemed to get faster when running it a 2nd or 3rd time consecutively.

Question: How much trouble would it be to only pull the current price of gold and silver from the same web site and put the results in cells G2 (gold) and H2 (silver)?
 
Did you use a different method in your last version of PullDataFromWeb as I do not see a VBA when using ALT + F8?
 
Back
Top