1. Welcome to Chandoo.org Forums. Short message for you

    Hi Guest,

    Thanks for joining Chandoo.org forums. We are here to make you awesome in Excel. Before you post your first question, please read this short introduction guide. When posting or responding to questions please remember our values at Chandoo.org are: Humility, Passion, Fun, Awesomeness, Simplicity, Sharing Remember that we have people here for whom English is not there first language and we need to allow for this in our dealings.

    Yours,
    Chandoo
  2. Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  3. When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Web Data Pulling - Need help in improving code.

Discussion in 'VBA Macros' started by shrivallabha, Jan 5, 2018.

  1. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,804
    Hello All,

    I have been working on a code for my friend to pull information from website (financial). Putting code together was not an issue.

    It is two phased approach:
    1. It pulls portfolio information from listed mutual funds.
    2. Then it scrapes individual share prices from the same site.

    Site is: http://www.moneycontrol.com/

    Question: The code works but it is slow. Painfully slow for phase 2 as there are hundreds of shares. Currently I am getting shares URL in phase 1 pull and then scraping using the same MSXML2.XMLHTTP60 object.

    Obviously, I am having few own ideas to improve efficiency.
    - Adding Dictionary to minimize scaping for shares by de-duplicating.
    - Pulling complete dump and having a mapping table to map to the share values I get.

    However, I'd like to know from experts to whether current code can be improved upon by using some other objects or methods?

    Attachment:
    In sheet 1 "MF-INDEX", you need to put Mutual Fund Name. I have kept one.
    In sheet 2 output is taken from above mentioned site.

    Thank you very much in advance for your ideas and inputs! If any clarification is required then let me know.

    Attached Files:

    NARAYANK991 and ThrottleWorks like this.
  2. shahin

    shahin Active Member

    Messages:
    657
    What happens when you kick out this below portion from your script and run again. Doesn't it fetch you the accurate results along with performing a lot faster.
    Code (vb):

    #If VBA7 Then
        Public Declare PtrSafe Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As LongPtr) 'For 64 Bit Systems
    #Else
        Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long) 'For 32 Bit Systems
    #End If  

    Do While objHTTP.readyState <> 4 ''this three liner loop is available in two places in your script: just shake it off
      Sleep 1000
    Loop
     
    So far my knowledge goes, there is no changes the above portion of code could bring about when they are bind with "Msxml2.xmlhttp" instead of "IE".
    ThrottleWorks and shrivallabha like this.
  3. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,804
    Thank you for looking into it, Shahin. You are correct and I have thrown it out. However, code overall time is not impacted much.
    ThrottleWorks likes this.
  4. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,062
    At a glance, you are writing result back to sheet for each loop iteration.

    I'd probably build array in code and put it back to sheet in one shot.

    As well, you are scraping using elements. Try string manipulation, it may end up being faster than .getElementbyXXX method.

    Edit: Oh wait, just noticed second send. within loop. Looks like bottle neck is due to VBA limitation. Since code can only send one request at a time.

    Though I'm not too familiar with it, python can accommodate for faster scraping via multithreaded processing.

    Here's link to the subject.
    https://medium.com/python-pandemoni...scraper-by-using-multiprocessing-f2f4ef838686
    Last edited: Jan 5, 2018
    ThrottleWorks likes this.
  5. Marc L

    Marc L Excel Ninja

    Messages:
    3,340
    Hi !

    I have not to much time but two points :

    • Tools exist to use Python in Excel …

    • If using requests is possible with your site,
    you can multiprocess requests calling from VBA same VBScript
    for each share as yet shown in this thread !
    Not necessary to create the VBScript file via VBA but just calling it
    as it can be created manually …
    As not necessary the second part to count down requests,
    it was just for the demonstration …
    ThrottleWorks and shrivallabha like this.
  6. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,062
    ThrottleWorks likes this.
  7. Marc L

    Marc L Excel Ninja

    Messages:
    3,340

    Years ago I heard about Python within Excel but nothing came.

    Another tool : PyXLL …​
    ThrottleWorks likes this.
  8. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,804
    Thank you guys for your inputs. I will try to work on some of them during this week and let you know.

    Off late, I have joined Stack Overflow and have answered few questions on Excel + VBA. But I have seen many posts on Python dealing with Excel. I don't do Python scripting but let me see.
    ThrottleWorks likes this.
  9. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,062
    If using xlwings I'd recommend Anaconda as interpreter.
    It has all the modules needed by xlwings and much easier to set up development environment than most other interpreter.
    ThrottleWorks likes this.
  10. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,062
    I've been playing with python a bit (Anaconda, Python 3.6.3). Was successful in returning result to Excel using xlwing.

    Though I'm having issue with multiprocessing, so speed can probably improve.

    Current code below.
    Code (vb):
    # -*- coding: utf-8 -*-

    import re
    import requests
    import xlwings as xw
    from bs4 import BeautifulSoup
    from time import sleep

    def getlisting(url):
        headers = {
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
       r = requests.get(url, headers=headers, timeout=10)
        if r.status_code == 200:
            html = r.text
            soup = BeautifulSoup(html, 'lxml')
           table = soup.find("table", {"class" : "tblporhd"})
            refs = table.find_all("a", {"class" : ""})
            links = ['http://www.moneycontrol.com' + link.get('href').replace(' ','') for link in refs]
           return links

    def parseinfo(surl):
        headers = {
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
       sr = requests.get(surl, headers=headers, timeout=10)
    #    sleep(2)
        if sr.status_code == 200:
            print('Processing..' + surl)
           shtml = sr.text
            ssoup = BeautifulSoup(shtml, 'lxml')
           tick = ssoup.find("span", id=re.compile('Nse_Prc_tick')).text
           if tick == None:
                tick = ssoup.find("span", id=re.compile('Bse_Prc_tick')).text + ' (Bse)'
           return tick
    mylinks = None
    mylinks = getlisting('http://www.moneycontrol.com/india/mutualfunds/mfinfo/portfolio_holdings/MMO029')

    mytick = [parseinfo(link) for link in mylinks]
    xw.Range('A1').options(transpose=True).value = mytick
    """ Code piece to export soup to text file """
    #with open('C:\\test\\out2.txt', "wb") as out:
    #    out.write(mysoup.encode("utf-8"))

    """ Function to get data from table including header """
    #def parse_table(table):
    #    return [
    #        [cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
    #          for row in table.find_all('tr')
    #    ]
    May try out another method (asyncio) when I find time.
    ThrottleWorks and shrivallabha like this.
  11. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,062
    bah, aiohttp that I was using in conjunction with asyncio has too many bugs/issues at the moment.

    For some odd reason, it works in most cases, but fails if URL list that I scrape is certain number (ex: 8 will fail, 9 works etc). And if 1 URL fails, entire asyncio process fails. :(

    Oh well, may be some other method.
    ThrottleWorks likes this.
  12. shahin

    shahin Active Member

    Messages:
    657
    @sir Chihiro, have you tried with grequests and openpyxl? grequests is for asynchronous call and openpyxl for writing the collected items in an excel file. I found both of them easy to play with. FYI, I have read few places about asyncio that it doesn't serve the purpose the way it should be because allegedly it sends requests in a blocking manner. However, the best option is always SCRAPY when it comes to send requests asynchronously.
  13. Marc L

    Marc L Excel Ninja

    Messages:
    3,340
    Create request & document objects only once at beginning of procedure,
    so before any loop …
  14. NARAYANK991

    NARAYANK991 Excel Ninja

    Messages:
    16,002
    Hi Shrivallabha ,

    I doubt that anything can improve the execution time of this application substantially.

    1. You are retrieving information on the portfolio of a mutual fund ; this is one outer loop which takes some time ; it is however , not the real issue , since the web request happens only once.

    2. The real time consuming code is the inner loop where for each share in the portfolio the market price is being retrieved using a web request.

    My timing for your original code , as it is , was 52 seconds.

    The code itself can be improved a little with the following :

    1. Eliminating both Sleep 1000 lines

    2. Using Application.ScreenUpdating = False

    3. Shifting the code for outputting columns 4 through 8 outside the inner loop.

    Doing all of the above reduced the timing to 41 seconds.

    For 26 web requests , this seems a reasonable amount of time.

    The only substantial improvement will come if we can eliminate web requests , which as you have said , can be done by storing each share market price in a dictionary , so that as you progress through the funds , retrieving the share prices is done more often from the dictionary than from the web.

    Narayan
  15. Marc L

    Marc L Excel Ninja

    Messages:
    3,340
    Another point : using XMLHTTP ActiveX makes request faster (instant)
    for the second call of a same ticker (in case of duplicate) as it works from cache but it can be a mess for a financial site if the result needs accuracy
    for example when a value change between requests …
    For accuracy WinHttp ActiveX is appropriate as it does not work from cache.
    The reason why it's obvious for financial scrapping
    to add in result a time stamp for each value …
  16. Chihiro

    Chihiro Excel Ninja

    Messages:
    4,062
    Above is true, if the process you are running doesn't support asyncio (and when you are not using .run_in_executor). You can test it by running simple script to read response of multiple sites.

    I'm currently playing with .run_in_executor along with urllib.requests to get stable result. Struggling a bit to wrap my head around recursive co-routine.
  17. Marc L

    Marc L Excel Ninja

    Messages:
    3,340
    Could you join a before workbook with at least 30 codes
    (much more if possible & with individual shares duplicates) and
    a result workbook with data filled for the two first codes …
  18. shrivallabha

    shrivallabha Excel Ninja

    Messages:
    1,804
    Thank you so much everyone for your support and suggestions. I have learned many things in this thread including some web scraping basics related stuff. Thank you @shahin @Chihiro @Marc L @NARAYANK991 for contributing!

    I couldn't learn python in this week so that option remained out. Instead I showed it to the person who wanted it and he was okay with the time being consumed. As manually it would take him longer. In fact, much longer.

    Accuracy between requests is not paramount here as the user intends to run it once in a quarter!

    I have done some work and optimized the code as much as possible with inputs here and a related question on Stack Overflow.

    Attached herewith is the final worked out version.

    Attached Files:

Share This Page