fbpx

How can you analyze 1mn+ rows data – Excel Interview Question – 02

Share

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
How-to handle more than million rows in Excel

As part of our Excel Interview Questions series, today let’s look at another interesting challenge. How-to handle more than million rows in Excel?

You may know that Excel has a physical limit of 1 million rows (well, its 1,048,576 rows). But that doesn’t mean you can’t analyze more than a million rows in Excel.

The trick is to use Data Model.

Excel data model can hold any amount of data

Introduced in Excel 2013, Excel Data Model allows you to store and analyze data without having to look at it all the time. Think of Data Model as a black box where you can store data and Excel can quickly provide answers to you.

Because Data Model is held in your computer memory rather than spreadsheet cells, it doesn’t have one million row limitation. You can store any volume of data in the model. The speed and performance of this just depends on your computer processor and memory.

How-to load large data sets in to Model?

Let’s say you have a large data-set that you want to load in to Excel.

If you don’t have something handy, here is a list of 18 million random numbers, split into 6 columns, 3 million rows.

Step 1 – Connect to your data thru Power Query

Go to Data ribbon and click on “Get Data”. Point to the source where your data is (CSV file / SQL Query / SSAS Cube etc.)

Get data > Get & Transform Data options

Step 2 – Load data to Data Model

In Power Query Editor, do any transformations if needed. Once you are ready to load, click on “Close & Load To..” button.

Close & Load to... options in Power Query

Tell Power Query that you want to make a connection, but load data to model.

Load data to Data Model in Excel

Now, your data model is buzzing with more than million cells.

Step 3 – Analyze the data with Pivot Tables

Go and insert a pivot table (Insert > Pivot Table)

Excel automatically picks Workbook Data Model. You can now see all the fields in your data and analyze by calculating totals / averages etc.

You can also build measures (thru Power Pivot, another powerful feature of Excel) too.

How to view & manage the data model

Once you have a data model setup, you can use,

  • Data > Queries & Connections: to view and adjust connection settings
  • Relationships: to set up and manage relationships between multiple tables in your data model
  • Manage Data Model: to manage the data model using Power Pivot
How to manage Data model in Excel - various options

Alternative answer – Can I not use Excel…

Of course, Excel is not built for analyzing such large volumes of data. So, if possible, you should try to analyze such data with tools like Power BI [What is Power BI?] This gives you more flexibility, processing power and options.

Watch the answer & demo of Excel Data Model

I made a video explaining the interview question, answer and a quick demo of Excel data model with 2 million rows. Check it out below or on my YouTube Channel.

Resources to learn about Excel Data Model

How do you analyze large volumes of data in Excel?

What about you? Do you use the data model option to analyze large volumes of data? What other methods do you rely on? Please post your tips & ideas in the comments section.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

Share this tip with your colleagues

Excel and Power BI tips - Chandoo.org Newsletter

Get FREE Excel + Power BI Tips

Simple, fun and useful emails, once per week.

Learn & be awesome.

Welcome to Chandoo.org

Thank you so much for visiting. My aim is to make you awesome in Excel & Power BI. I do this by sharing videos, tips, examples and downloads on this website. There are more than 1,000 pages with all things Excel, Power BI, Dashboards & VBA here. Go ahead and spend few minutes to be AWESOME.

Read my storyFREE Excel tips book

Excel School made me great at work.
5/5

– Brenda

Excel formula list - 100+ examples and howto guide for you

From simple to complex, there is a formula for every occasion. Check out the list now.

Calendars, invoices, trackers and much more. All free, fun and fantastic.

Advanced Pivot Table tricks

Power Query, Data model, DAX, Filters, Slicers, Conditional formats and beautiful charts. It's all here.

Still on fence about Power BI? In this getting started guide, learn what is Power BI, how to get it and how to create your first report from scratch.

12 Responses to “How can you analyze 1mn+ rows data – Excel Interview Question – 02”

  1. Ujjwal Kumar says:

    Thanks for this tutorial

  2. Thanks for your time to give us!

  3. Leon Nguyen says:

    How can i make Vlookup work for 1 million rows and not freeze up my computer? I have 2 sets of data full of customer account numbers and i want to do Vlookup to match them up. Thank you.

    • blahblahblacksheep says:

      At that point, I would just import both spreadsheets into an Access database, and run a query that connects them based on a common field/ID (eg: account_ID). Excel cells are all variants, so they take up a lot of memory. But, in Access, you can specify the data type of each field (text, int, decimal, etc). Then identify primary keys on each table (eg: account_ID). Just create a quick Access database, import the spreadsheets, make a query, connect them, and drag-n-drop the fields you want for the query output. Excel is good for pivot tables and such on pre-aggregated data. When you're trying to do massive data merges (like VLOOKUP on million rows), it's time to shift the data into a data-oriented solution which can spit out an aggregate data query which you can then export to excel and do the rest of your analysis on. You might be able to leverage the data model he suggested. Put both spreadsheets into the Data Model as data sources, then see if you can open up a drag-n-drop data model screen to hook them together via Primary & Foreign key values (again, Account_ID or such). Another way is to run the VLOOKUP, wait for it to finish, then copy the column that had values filled in and paste-as values. If you have to do multiple columns of VLOOKUP, just do them one-at-a-time, and then copy/paste-as values. This means your spreadsheet won't be agile (IE: won't update if you change the source data of the VLOOKUP). But, if you're just trying to merge data sets quick-n-dirty style, it can be done. Trying to run multiple VLOOKUPS on a million record data set at the same time will kill your computer. So, just do them one-at-a-time and paste-as values.

  4. Myles says:

    "well, its 1,048,576 rows"
    it's

  5. Steve says:

    The international standard abbreviation for Million is capital M. NOT mn. You shouldn't make up your own terms. It takes away from any sense we have that you know what you are talking about. BTW, lower case m is milli or 1/1000. Lower case n is nano or 1/10^-9, or one billionth.

    • Chandoo says:

      Thanks for the detail Steve.

    • Andrew says:

      1mn is million in some financial cases and some countries
      1MM is million in some industries
      1kk is also million in some countries.

      That you dont know about this could take away any sense anyone has that you know what you are talking about, or leave the impression that you are looking to be rude.

      I am also pretty sure you understood it to be 1 million and you were not opening the article expecting it to show you 1 nanometer in rows in excel.

  6. Veronica says:

    You are the best

  7. RUPAM says:

    CAN I USE THIS IN COMPARING INPUT FILE DATA AND TABLE DATA TO CONFIRM THE DATA CORRECTION?

    • Allan says:

      Rupam
      If you send me details I will show you how I would tackle COMPARING INPUT FILE DATA AND TABLE DATA
      Allan

  8. allan brayshaw says:

    'Introduction
    'An alternative approach to Excel handling large volumes of data is to use Collections to consolidate the data before making it
    'available for Excel to process. This technique is simple to code and much faster to run than asking Excel to process large volumes of data.
    'Consider a large dataset containing 5 years files of Date/customer account number/customer name/many address lines/phone number/transaction value
    'and where customers will generate many transactions per day.
    'Let's assume the business requirement is to provide a report of total transaction value per customer by year/month.
    'Rather than placing these records in an Excel worksheet, use VB to read the 5 yearly files and consolidate the information in a memory
    'collection by year/month rather than year/month/day.
    'With, say, 300,000 rows and 10 columns of data, Excel has 3 million cells to process.
    'These 300,000 records could be consolidated into a maximum of 60,000 records:- 5 years x 12 months x 1,000 customers each with a total transaction value.
    'at the same time any rows or columns not contributing to the report can be excluded. (10 columns reduced to 4 in this example)
    'In this example, consolidation reduces the data to 60,000 rows and 4 columns (i.e. 240,000 cells).
    'Therefore Excel data volumes reduce to 240,000 cells irrespective of data volume processed.
    'Using a memory Collection to consolidate the data results in Excel having far less data to process and less data means less time to process.
    'and the Excel 1 million row limit would not apply to the input data unless the consolidated data exceeds this limit - very unlikely.
    'The code below assumes the data is already held in an Excel worksheet but this could easily be converted to reading from one or more files.
    'The earliest record I can find of me using the Collection Class was 2008 (I was working as a Performance engineer at BT when the Excel row limit was 64k
    'and I had 1.2 million 999 calls to analyse) but there is also a Dictionary class which I hear has benefits over the Collections class.
    'I have never used this class but see details https://www.experts-exchange.com/articles/3391/Using-the-Dictionary-Class-in-VBA.html
    'Setup a demo as follows:
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'Copy/Paste the following text into an Excel code Module in an empty .xlsm Workbook and rename 2 sheets Data and Consolidation
    'Insert a new Class Module - Class1
    'Cut/Paste the following 34 rows of text to create a Class1 definition record of only those data fields required for analysis
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'Option Explicit
    'Private pYYMM As String
    'Private pAccountNumber As String
    'Private pCustomerName As String
    'Private pTotalValue As Single
    '
    'Property Let YYMM(xx As String) 'YYMM
    ' pYYMM = xx
    'End Property
    'Property Get YYMM() As String
    ' YYMM = pYYMM
    'End Property
    'Property Let AccountNumber(xx As String) 'AccountNumber
    ' pAccountNumber = xx
    'End Property
    'Property Get AccountNumber() As String
    ' AccountNumber = pAccountNumber
    'End Property
    'Property Let CustomerName(xx As String) 'CustomerName
    ' pCustomerName = xx
    'End Property
    'Property Get CustomerName() As String
    ' CustomerName = pCustomerName
    'End Property
    'Property Let TotalValue(xx As Single) 'TotalValue
    ' pTotalValue = xx
    'End Property
    'Property Get TotalValue() As Single
    ' TotalValue = pTotalValue
    'End Property
    ''''''''''''''''''''''''''''''''''''''''''''''''''
    '''Cut/Paste up to here and remove the first ' comment on each row after pasting into Class1

    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

    'Code module content
    Option Explicit
    Dim NewRecord As New Class1 'record area (definition of new Class1 record)
    Dim Class1Collection As New Collection 'A collection of keyed records held in memory
    Dim Class1Key As String 'Collection Key must be a String
    Dim WalkRecord As Class1
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'This code creates the text key used to access each Class1 record in memory by year, month and account number:-
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    Dim Rw As Long
    Dim StartTime As Date
    Sub Consolidate()
    StartTime = Now()
    Sheets("Data").Select
    For Rw = 2 To Range("A1").CurrentRegion.Rows.Count
    Class1Key = Format(Range("A" & Rw).Value, "yymm") & Range("B" & Rw).Value 'Key = yymm & AccountNumber
    ''''''''''''''''''''''''''''''''
    'This code populates the Class1 record:-
    ''''''''''''''''''''''''''''''''
    NewRecord.YYMM = Format(Range("A" & Rw).Value, "yymm") 'YYMM
    NewRecord.AccountNumber = Range("B" & Rw).Value 'AccountNumber
    NewRecord.CustomerName = Range("C" & Rw).Value 'CustomerName
    NewRecord.TotalValue = Range("J" & Rw).Value 'TotalValue
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'This code writes the Class record into memory and, if an identical key already exists, accumulate TotalValue:-
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    On Error Resume Next
    Call Class1Collection.Add(NewRecord, Class1Key)
    Select Case Err.Number
    Case Is = 0
    On Error GoTo 0
    '******** NEW RECORD INSERTED OK HERE *********
    Case Is = 457
    On Error GoTo 0
    '******** CONSOLIDATE DUPLICATE KEY CASES HERE ************
    Class1Collection(Class1Key).TotalValue = Class1Collection(Class1Key).TotalValue + NewRecord.TotalValue
    Case Else
    '******** UNEXPECTED ERROR *****************
    Err.Raise Err.Number
    Stop
    End Select
    Set NewRecord = Nothing
    Next Rw
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ' When all the records have been processed, This code writes Collection entries to a WorkSheet:-
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    Sheets("Consolidation").Select
    Range("A1:D1").Value = Array("Date", "account number", "customer name", "total value")
    Rw = 2
    For Each WalkRecord In Class1Collection 'walk collection and write rows
    Cells(Rw, 1) = "20" & Left(WalkRecord.YYMM, 2) & "/" & Right(WalkRecord.YYMM, 2)
    Cells(Rw, 2) = WalkRecord.AccountNumber
    Cells(Rw, 3) = WalkRecord.CustomerName
    Cells(Rw, 4) = WalkRecord.TotalValue
    Rw = Rw + 1
    Next WalkRecord
    Columns("A:A").NumberFormat = "dd/mm/yyyy;@"
    Columns("D:D").NumberFormat = "#,##0.00"
    Columns("A:D").HorizontalAlignment = xlCenter
    Columns("A:D").EntireColumn.AutoFit
    Set Class1Collection = Nothing
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'This code sorts the Consolidated records into AccountNumber within YY/MM
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    Rw = Range("A1").CurrentRegion.Rows.Count
    ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Clear
    ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Add Key _
    :=Range("A2:A" & Rw), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
    ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Add Key _
    :=Range("B2:B" & Rw), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
    With ActiveWorkbook.Worksheets("Consolidation").Sort
    .SetRange Range("A1:D" & Rw)
    .Header = xlYes
    .MatchCase = False
    .Orientation = xlTopToBottom
    .SortMethod = xlPinYin
    .Apply
    End With
    Range("A2").Select
    MsgBox ("Runtime " & Format(Now() - StartTime, "hh:mm:ss")) '300k records takes 1 min 46 seconds on my ancient HP laptop
    End Sub
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'Now run macro Generate to create 300k test rows, followed by macro Consolidate to demonstrate consolidation by data collection.
    'In my experience the improvement in Excel run time is so spectacular that no further action is required
    'but further run time improvements could result from avoiding calling the Format function twice every row,
    'or holding CustomerName in another collection and accessing during the Walk process, or using the Dictionary class?
    'Note that is advisable to run macro ClearSheets before saving your WorkBook.
    'Also the Record constant can be manually adjust up to a maximum of 1048575 if required
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    Sub Generate() 'Generates test data in Sheet Data
    Const StartDate As Date = 42370 '01/01/2016
    Const FinishDate As Date = 44196 '31/12/2020
    Const Records As Long = 300000 'number of records to generate - max 1048575
    Const Customers As Integer = 1000 'number of Customers
    Const StartValue As Integer = 5 '£5
    Const FinishValue As Integer = 100 '£100
    StartTime = Now()
    Sheets("Data").Select
    Range("A1:J1").Value = Array("Date", "account number", "customer name", "address 1", "address 2", "address 3", "address 4", "address 5", "phone number", "tx value")
    For Rw = 2 To Records + 1
    Cells(Rw, 1) = WorksheetFunction.RandBetween(StartDate, FinishDate) 'Date
    Cells(Rw, 2) = WorksheetFunction.RandBetween(1, Customers) 'Customer number
    Range(Cells(Rw, 3), Cells(Rw, 9)) = Array("Customer " & Cells(Rw, 2), "Not required", "Not required", "Not required", "Not required", "Not required", "Not required")
    Cells(Rw, 10) = WorksheetFunction.RandBetween(StartValue, FinishValue) 'Value
    Next Rw
    Columns("A:A").NumberFormat = "dd/mm/yyyy;@"
    Columns("J:J").NumberFormat = "#,##0.00"
    Columns("A:J").HorizontalAlignment = xlCenter
    Columns("A:J").EntireColumn.AutoFit
    Range("A2").Select
    MsgBox ("Runtime " & Format(Now() - StartTime, "hh:mm:ss")) '300k records takes 1 min 50 seconds on my ancient HP laptop
    End Sub
    Sub ClearSheets() 'delete all data before saving
    Sheets("Data").Select
    Range("A1").CurrentRegion.Delete Shift:=xlUp
    Range("A2").Select
    Sheets("Consolidation").Select
    Range("A1").CurrentRegion.Delete Shift:=xlUp
    Range("A2").Select
    End Sub
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'Finally, a data collection can comprise of values and keys without requiring a Class record.
    'Such a collection can be used to identify data duplicates (e.g. value/key=row number/cellcontent) or hold a lookup table (e.g. value/key=name/account)
    'Most applications process far more data than they report therefore using collections benefits most data sets irrespective of volume.
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''Example of generating subtotals by writing additional records into the collection
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    'Sub ConsolidateWithSubtotals()
    'Dim SubTotalRecord As New Class1 'subtotals
    ' StartTime = Now()
    ' Sheets("Data").Select
    ' For Rw = 2 To Range("A1").CurrentRegion.Rows.Count
    ' Class1Key = Format(Range("A" & Rw).Value, "yymm") & Range("B" & Rw).Value 'Key = yymm & AccountNumber
    '''''''''''''''''''''''''''''''''
    ''Next populate the Class1 record:-
    '''''''''''''''''''''''''''''''''
    ' NewRecord.YYMM = Format(Range("A" & Rw).Value, "yymm") 'YYMM
    ' NewRecord.AccountNumber = Range("B" & Rw).Value 'AccountNumber
    ' NewRecord.CustomerName = Range("C" & Rw).Value 'CustomerName
    ' NewRecord.TotalValue = Range("J" & Rw).Value 'TotalValue
    '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''Next write the Class record into memory and, if an identical key already exists, accumulate TotalValue:-
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ' On Error Resume Next
    ' Call Class1Collection.Add(NewRecord, Class1Key)
    ' Select Case Err.Number
    ' Case Is = 0
    ' On Error GoTo 0
    ' '******** NEW RECORD INSERTED OK HERE *********
    ' Case Is = 457
    ' On Error GoTo 0
    ' '******** CONSOLIDATE DUPLICATE KEY CASES HERE ************
    ' Class1Collection(Class1Key).TotalValue = Class1Collection(Class1Key).TotalValue + NewRecord.TotalValue
    ' Case Else
    ' '******** UNEXPECTED ERROR *****************
    ' Err.Raise Err.Number
    ' Stop
    ' End Select
    ' Set NewRecord = Nothing
    '''''''''''''''''''''''
    ''accumulate subtotals
    '''''''''''''''''''''''
    ' Class1Key = Format(Range("A" & Rw).Value, "yymm") 'Key = yymm
    ' SubTotalRecord.YYMM = Format(Range("A" & Rw).Value, "yymm") 'YYMM
    ' SubTotalRecord.AccountNumber = "Subtotal"
    ' SubTotalRecord.CustomerName = ""
    ' SubTotalRecord.TotalValue = Range("J" & Rw).Value 'TotalValue
    ' On Error Resume Next
    ' Call Class1Collection.Add(SubTotalRecord, Class1Key)
    ' Select Case Err.Number
    ' Case Is = 0
    ' On Error GoTo 0
    ' '******** NEW RECORD INSERTED OK HERE *********
    ' Case Is = 457
    ' On Error GoTo 0
    ' '******** CONSOLIDATE DUPLICATE KEY CASES HERE ************
    ' Class1Collection(Class1Key).TotalValue = Class1Collection(Class1Key).TotalValue + SubTotalRecord.TotalValue
    ' Case Else
    ' '******** UNEXPECTED ERROR *****************
    ' Err.Raise Err.Number
    ' Stop
    ' End Select
    ' Set SubTotalRecord = Nothing
    ' Next Rw
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    '' When all the records have been processed, write Collection entries to a WorkSheet:-
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ' Sheets("Consolidation").Select
    ' Range("A1:D1").Value = Array("Date", "account number", "customer name", "total value")
    ' Rw = 2
    ' For Each WalkRecord In Class1Collection 'walk collection and write rows
    ' If WalkRecord.AccountNumber = "Subtotal" Then
    ' Cells(Rw, 1) = "20" & Left(WalkRecord.YYMM, 2) & "/" & Right(WalkRecord.YYMM, 2)
    ' Cells(Rw, 2) = "Subtotal 20" & Left(WalkRecord.YYMM, 2) & "/" & Right(WalkRecord.YYMM, 2)
    ' Cells(Rw, 3) = WalkRecord.CustomerName
    ' Cells(Rw, 4) = WalkRecord.TotalValue
    ' Else
    ' Cells(Rw, 1) = "20" & Left(WalkRecord.YYMM, 2) & "/" & Right(WalkRecord.YYMM, 2)
    ' Cells(Rw, 2) = WalkRecord.AccountNumber
    ' Cells(Rw, 3) = WalkRecord.CustomerName
    ' Cells(Rw, 4) = WalkRecord.TotalValue
    ' End If
    ' Rw = Rw + 1
    ' Next WalkRecord
    ' Set Class1Collection = Nothing
    ''
    ' Rw = Range("A1").CurrentRegion.Rows.Count
    ' ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Clear
    ' ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Add Key _
    ' :=Range("A2:A" & Rw), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
    ' ActiveWorkbook.Worksheets("Consolidation").Sort.SortFields.Add Key _
    ' :=Range("B2:B" & Rw), SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
    ' With ActiveWorkbook.Worksheets("Consolidation").Sort
    ' .SetRange Range("A1:D" & Rw)
    ' .Header = xlYes
    ' .MatchCase = False
    ' .Orientation = xlTopToBottom
    ' .SortMethod = xlPinYin
    ' .Apply
    ' End With
    ' Range("A2").Select
    ' MsgBox ("Runtime " & Format(Now() - StartTime, "hh:mm:ss")) '300k records takes 1 min 46 seconds on my ancient HP laptop
    'End Sub

Leave a Reply