• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Excel for Office 365 - Unexplained file size difference

sccowl

New Member
I posted a similar thread a couple of months ago, but am still struggling to get to the bottom of this issue, so trying again to see if anyone can shine a light on what is happening. I was previously called out for cross posting this topic, but as with then, this is the only place I am seeking help.

To summarise, I built financial model which was a relatively large 19.1Mb in size. It is stored on a different Office 365 sharepoint domain to the one I built it on. With no material addition of new data, the working version increased to over 30Mb. It has remained relatively stable after the initial increase.

I am at a loss to understand why the file has grown so much bigger and what I can do to fix and prevent it happening again.

Using the zip file method to review XML sheet size, showed large increases across most sheets, some over 100%, but many of these on sheets that had no change in data.

Frustratingly, the one which does include updates for current months financials and therefore additional data (although changing from 0 to a new values rather than blank so now new cells used) is about the only sheet that hasn't changed in size.

I stripped back a copy of each model to test the differences, and have got it back to just 1 template sheet, which was hidden and therefore hasn't been used. My original version of this sheet is 817k but this has increased to 1,822k (the saved xlsm excel file size).

I've checked both versions, and as expected, can't see any visual differences. Both contain 36,786 cells that contain some data, as per cell count displayed.

Both me and the users are using Excel 2016 via Office 365. The files are stored on sharepoint servers.

Any idea what could have caused two apparently identical files to be so different in size over such a short period of time, and is there anything I can do to prevent it happening?

Thanks
 
Few things contribute to size increase.

1. Formatting info. Each different formatting applied will create it's own node in xml and will contribute to file size increase. In general, I keep formatting to minimum and only at last stage of report.

2. Pivot Cache. Depending on how your pivots are generated, it may create duplicate cache, each connecting to different pivot. This will cause bloat in size. Typically, I make sure there's only single cache for all my pivots by using single data model as source (loaded via PQ).

There are other factors that would impact (such as Names stored, etc) but above 2 are usually the main culprit I've encountered in file size bloat.

It would help, if you can upload the stripped down version (both before and after bloat) to some cloud service and share it with us. We can then use tools to analyze the difference between the two (such as using PowerQuery to read xml content of each file).
 
Thank very much Chihiro,

1. There is formatting within the file including some conditional formatting and groups, but they should be identical in both.
2. There are no pivots in the model - everything is consolidated using SUMIFS to specific locations which change dynamically based on current period

There is extensive use of named ranges, but as per formatting, they are identical in both.

The users are just adding the latest monthly updates and potentially manipulating some forecasting assumptions, but there is (or should be) little new data introduced.

I have stripped back to just one sheet to uploaded to google drive.

https://drive.google.com/open?id=1-PZwabW-VvEjRbfFBtCarkGRXl2xA0lH

I've also taken the data out, so there are lots of #REF errors, but the file size differences remain.

The original version of the sheet is 768k and the updated version (ie. The one used by other users) is 1,765k - more than twice the size.

I have also uploaded the results of my simple zip XML test showing each sheet size from both models.

Thanks again if you are able to take a look and spot anything that is causing the difference.

UPDATE - I forgot to add, there are a few simple VBA routines, so saved as XLSM. I can strip this out as well if needed.
 
Few findings:
1. sheet1.xml size difference is about 8,000 KB.
upload_2019-1-23_14-17-33.png

2. Dimension of Original is A1:LN2083 but Updated is A1:LM2101

3. There are extra 6 names found in Updated workbook not present in Original. 403 vs 397.

4. Though difference in size would be minuscule... Updated has more code line than Original (about 1KB size diff).

5. CF ranges are different between the two. Also additional CF found in Updated.

So... this leads me to believe that there's more than just upload to SharePoint going on here.
 
A killer might be "last cell", can be found via Find Special or a shortcut I can't remember for the moment (CTRL + END or CTRL + arrow right?).
Delete "empty" columns and ditto rows. I've seen large workbooks shrink to a ridiculous small size by doing this.
 
Thanks very much Chiro and Guido,

Really interesting, although confusing. Its more plausible that the change is in Excel rather than differences in sharepoint sites. There is no reason for the users to expand the dimension on the sheet. I could just about believe they may have inserted some lines, which could also explain the changes in CF, although they told me they haven't done anything and there was no reason to. They have no understanding of named ranges so doubt they have added anything. However, it gives me a good starting point to investigate from.

Are there any resources that you can point me to explaining how to use PowerQuery to do the analysis you've done. It would be good to understand more about analysing XML.

Thanks again.
 
Note: Names are not limited to named ranges. But also includes tables and other defined areas (use code below to check count of names in workbooks).
Code:
Debug.Print ThisWorkbook.Names.Count

Will see if I have time later to write up tutorial on how to access XML structure in Excel workbook through PowerQuery.
 
Ok here's basics of how to read XML from Excel file. Note that this works for only files after Excel 2007, I believe.

You can read following types: .xlsx, .xlsm, .xlsb
Note that some values will be different for .xlsb as some xml elements are not parsed for binary file.

Concept:
Excel is structured using Office Open XML file format. Using ECMA-376 standard.
You can read about it in links below (for the 2nd link, there's newer SDK 2.9 out, but documentation is still accurate)
http://www.ecma-international.org/publications/standards/Ecma-376.htm
https://docs.microsoft.com/en-us/office/open-xml/open-xml-sdk

So, Excel file is basically collection of XML files compressed using Zip. Then in order to read XML content of a workbook, you'll need to unzip Excel workbook. You can either do this manually, or using PowerQuery custom function.

See link for function read zipped file in PowerQuery (PQ).
http://sql10.blogspot.com/2016/06/reading-zip-files-in-powerquery-m.html

Steps:
1. Go into PQ editor by using Data->GetData->From Other Sources->Blank Query
2. Add the Unzip function from link and name it "fnUnzip"
3. Add another blank query and paste in following using advanced editor. replacing file path. This will net you list of xml files like below.
Code:
let
    Source = File.Contents("C:\Test\Book1.xlsm"),
    Files = fnUnzip(Source)
in
    Files

upload_2019-1-24_14-3-20.png

4. Click on binary content of your choice. It will auto generate 2 line table (header & body). Delete the last step and go to "Content" step. Right click on the image and choose "XML".
upload_2019-1-24_14-6-3.png

5. Navigate and expand (or build custom function to read specific content of XML).
upload_2019-1-24_14-7-9.png

NOTE: You can refer to link below, where I used similar steps to read comments from closed workbook.
https://www.excelforum.com/excel-pr...ding-the-xml-of-a-workbook-2.html#post5037804
 
Back
Top