Power Query performance - 29 million row csv

mike.magill · Sep 27, 2023

I have two csv files; one is c.4Gb (19 million rows and 29 fields per row) and the other is c.9Gb (29 million rows and 20 fields per row). I have tried unsuccessfully to get the client to store this data in SQL server so I have to use the data in this format for Power Query analysis.

I have Power BI Desktop with Dax Studio and know that I can theoretically transform these files and export as new smaller csv files (e.g. removing unneeded fields) but my computer just can't seem to cope with the file sizes. I have an i7 processor and 16Gb of ram.

If I upgrade my laptop to a newer processor and, say, 64Gb is that going to help? What other things can I try? I'm currently seeing if increasing the 'maximum memory user per simultaneous evaluation' helps.

GraH - Guido · Sep 27, 2023

Before buying something new... you did install 64 bit versions, right?
Maybe share the query statements, so we might suggest some code tweaking to improve performance.

If data is stored in SQL server, then the PQ statement is translated into SQL query and the query folds, meaning the server does all the hard work. But one can break the folding...

mike.magill · Sep 27, 2023

GraH - Guido said:
Before buying something new... you did install 64 bit versions, right?
Maybe share the query statements, so we might suggest some code tweaking to improve performance.

If data is stored in SQL server, then the PQ statement is translated into SQL query and the query folds, meaning the server does all the hard work. But one can break the folding...

Thanks for the response. My Power BI version is Version: 2.121.942.0 64-bit (September 2023).

I'd love to get the csv into SQL but can't persuade the client.

This is an example of the query statements. There are others but even just doing this one won't work. The #"Kept First Rows" step allows me to test on a subset of the dataset based on the RowsReturned parameter.

Any further advice would be greatly appreciated.

Code:

let
    Source = fnSmartFolder(FilePath),
    #"Filtered Rows" = Table.SelectRows(Source, each Text.Upper([Name]) = "VELOCITY - ALL DATA.CSV"),
    Content = #"Filtered Rows"{0}[Content],
    #"Imported CSV" = Csv.Document(Content,[Delimiter=",", Columns=20, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(#"Imported CSV", [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ACCOUNT_NUMBER", Int64.Type}, {"LEGACY_ACCOUNT_NUMBER", type text}, {"ACCOUNT_NAME", type text}, {"TRANSACTION_DATE", type date}, {"DOCUMENT_TYPE", type text}, {"TRANSACTION_TYPE", type text}, {"TRANSACTION_STATUS", type text}, {"TRANSACTION_REF", type text}, {"NET_TRANSACTION_VALUE", Currency.Type}, {"TRANSACTION_VAT_VALUE", Currency.Type}, {"EFFECTIVE_START_DATE", type date}, {"EFFECTIVE_END_DATE", type date}, {"BILL_DATE", type date}, {"SYSTEM_TIME", type datetime}, {"TRANSACTION_DESC1", type text}, {"TRANSACTION_DESC2", type text}, {"USER_ID", type text}, {"TRANSACTION_ID", Int64.Type}, {"TRANSACTION_TYPE_2", type text}, {"USER_2", type text}}),
    #"Kept First Rows" = if RowsReturned = 0 then #"Changed Type" else Table.FirstN(#"Changed Type",RowsReturned),
    #"Removed Other Columns" = Table.SelectColumns(#"Kept First Rows",{"ACCOUNT_NUMBER", "TRANSACTION_DATE", "DOCUMENT_TYPE", "TRANSACTION_REF", "NET_TRANSACTION_VALUE", "TRANSACTION_VAT_VALUE", "EFFECTIVE_START_DATE", "EFFECTIVE_END_DATE", "BILL_DATE", "SYSTEM_TIME", "TRANSACTION_DESC1", "TRANSACTION_DESC2", "TRANSACTION_ID", "TRANSACTION_TYPE_2"}),
    #"Replaced Errors" = Table.ReplaceErrorValues(#"Removed Other Columns", {{"ACCOUNT_NUMBER", null}, {"TRANSACTION_DATE", null}, {"DOCUMENT_TYPE", null}, {"TRANSACTION_REF", null}, {"NET_TRANSACTION_VALUE", null}, {"TRANSACTION_VAT_VALUE", null}, {"EFFECTIVE_START_DATE", null}, {"EFFECTIVE_END_DATE", null}, {"BILL_DATE", null}, {"SYSTEM_TIME", null}, {"TRANSACTION_DESC1", null}, {"TRANSACTION_DESC2", null}, {"TRANSACTION_ID", null}, {"TRANSACTION_TYPE_2", null}}),
    #"Uppercased Text" = Table.TransformColumns(#"Replaced Errors",{{"TRANSACTION_REF", Text.Upper, type text}}),
    #"Filtered Rows1" = Table.SelectRows(#"Uppercased Text", each [SYSTEM_TIME] <= Cutoff),
    #"Sorted Rows" = Table.Sort(#"Filtered Rows1",{{"ACCOUNT_NUMBER", Order.Ascending}})
in
    #"Sorted Rows"

GraH - Guido · Sep 27, 2023

One thing that's pops out: why do you connect to a folder when afterwards you select only the first one?

I''ll have a closer look at this tomorrow, since it's evening on my end.

mike.magill · Sep 27, 2023

GraH - Guido said:
One thing that's pops out: why do you connect to a folder when afterwards you select only the first one?

I''ll have a closer look at this tomorrow, since it's evening on my end.

Sorry, in trying to post the code in this forum, I brought in the previous query as part of this. I actually have a separate query that shows the contents of the folder and then I have a number of queries that reference that to pull in different files - this being one of them

GraH - Guido · Sep 29, 2023

So I had some time just now to try to make sense out of the query steps.

You are sorting the data in your last step. On a huge amount of data, that's intensive and time consuming. Is it really needed? Since you're loading in PowerBI no-one will ever see the table. And when making a matrix, you sort any way you want.

The other steps, I would avoid the hard coding, but that won't affect your query performance. Why is the upper case required on Transaction Reference?

I've loaded in Excel, so not even PowerBI, csv-files that combined created a table holding over 100 Millions of rows. It took 15 minutes to load but my laptop could handle it: a i7, 11th generation with 32GB of ram. You have slightly more columns (14, while I had 10). This was for demo purposes, so without any "fancy" transformation or M-code trickery.

Last thing: you can try to convert your query to a function and apply the function on the csv content. I've witnessed serious efficiency gain while doing so. Though I cannot explain why that is. In the end it's all the same steps being applied to the same source data.

mike.magill · Sep 29, 2023

GraH - Guido said:
So I had some time just now to try to make sense out of the query steps.

You are sorting the data in your last step. On a huge amount of data, that's intensive and time consuming. Is it really needed? Since you're loading in PowerBI no-one will ever see the table. And when making a matrix, you sort any way you want.

The other steps, I would avoid the hard coding, but that won't affect your query performance. Why is the upper case required on Transaction Reference?

I've loaded in Excel, so not even PowerBI, csv-files that combined created a table holding over 100 Millions of rows. It took 15 minutes to load but my laptop could handle it: a i7, 11th generation with 32GB of ram. You have slightly more columns (14, while I had 10). This was for demo purposes, so without any "fancy" transformation or M-code trickery.

Last thing: you can try to convert your query to a function and apply the function on the csv content. I've witnessed serious efficiency gain while doing so. Though I cannot explain why that is. In the end it's all the same steps being applied to the same source data.

Thanks for the feedback. I'll try the function suggestion.

The reason for sorting is that, in other situations, I want to be able to be able to extract just one ACCOUNT_NUMBER. I was planning to sort the data by ACCOUNT_NUMBER, then re-export as a new csv. Then, on that new csv, create a grouping of first row per account and number of rows per account. That way I can use Table.Range to extract a single customer rather than Table.Filter. I believe this will be significantly faster.

GraH - Guido · Sep 29, 2023

Puzzled again by the why you'd want to export to another CSV.
One can add a filter option in the PowerBI report.

Obviously I do not have any view on the use case and working environment, it just feels like you look for solutions that cause new problems, in the end overcomplicating things.

Though I have not yet done that myself, but I hear it's feasible to restart a new PowerBI based on an existing data model. Once the report is published in the cloud not only the report but also the data model is uploaded there and available for others to use. Maybe it needs sharing or so, that's possible. I should look into that.

Other BI tools offer to have a "coordinated" field, meaning a single report is created for each value in a filter automatically. Much like a pivot table in Excel can create a sheet for each value in a pivot filter. I do not know if PowerBI offers something similar though.

Any way, good luck with creating the function. Cross fingers there is some improvement.

Power Query performance - 29 million row csv

mike.magill

New Member

GraH - Guido

Well-Known Member

mike.magill

New Member

GraH - Guido

Well-Known Member

mike.magill

New Member

GraH - Guido

Well-Known Member

mike.magill

New Member

GraH - Guido

Well-Known Member