Using PowerQuery to extract only column headers from multiple files

DashboardNovice · May 18, 2021

I have 100 files, each with a tab called Project. Each Project tab consists of column headers and data under those columns. Each tab also has about 8 rows before the column headers. Those 8 rows hold information such as the name of the project manager, etc. These rows are junk so I can remove them with the Remove Rows feature. I am mentioning this just in case this effects what I am trying to do.

I am trying to extract the column headers from each Project tab (not the data values), then I want to dump them into Excel. Instead of promoting my column headers I will leave them in place, which means the top row in the editor will be named like Column1, Column2, etc. which is ok with me.

If you load the two attached files from a folder, then view this in the PowerQuery editor, I want to Close & Load only rows 5 and 14. In this case, I can simply filter Column1 for 'Project Code' or something like that, but in reality I am dealing with 100 files with hundreds of rows each, so it won't be as easy as filtering for some value. The reason I say this is because in my real life example, under any given column number you will have the correct column heading from my source data, the values in the source data and possibly incorrect column headings from my source data. When you use the AutoFilter to view all of those possibilities I can spot the correct column heading, but the problem is with all of the other values in the list. The vast majority of those values are data values that fall under the column in my source data, but mixed in there, you might have an incorrect column heading and it is virtually impossible to tell which of the values in the filter is an incorrect column heading.

So in the editor I want to select only for column headings (whether correct or incorrect) thereby weeding out individual data values. I suppose another approach (this is less desirable because it involves loading more rows) would be to load everything from the editor (or just bypassing the editor) to a blank tab, then filter out all data values, leaving only column headers (whether correct or incorrect).

I hope this isn't too confusing but this is the best way I can explain it. I want to extract only column headers and leave all data values under those headers behind. Ultimately I would need to find a way to "mark" the rows that are column headers but I don't see a way to do that.

Edit: One more important point. Which columns from the Excel file will be displayed in the query editor? That seems to depend on which file you choose as the 'Sample' file after you click Combine & Load or Combine & Transform. Is this true? If so, this is a problem because out of the 100 files I am working with I have no way of knowing which files contain additional columns and which do not. If I understand how all this works, that would mean I would have to select a file where additional columns were added (I have no way of knowing which files those are, and the whole point of what I am trying to do is to identify which files contain extra columns that were wrongfully inserted), so that those incorrect values display in the editor. If I happen to select a file where all column headings are correct and there are no additional columns inserted, then I believe that the incorrect columns wouldn't appear in the editor which is not good. This appears to be how it works with my two attached sample files. Is this true? In order to identify cases where extra columns were inserted, do I need to select one of those files as the 'Sample' file?

Excel Wizard · May 19, 2021

That is a very long explanation, I couldn't finish reading it.

Maybe put some expected result is easier.

My guess is

Code:

let
    Source = Folder.Files("D:\New folder"),
    GTable = Table.ExpandTableColumn(Table.AddColumn(Source, "S", each Excel.Workbook([Content])), "S", {"Data"}, {"Data"}),
    PromoteHeader = Table.TransformColumns(GTable,{"Data",each Table.PromoteHeaders(Table.SelectRows(_, each [Column1]<>null)) }),
    Combine = Table.Combine(PromoteHeader[Data]),
    KeptRows = Table.FirstN(Combine,0)
in
    KeptRows

Change folder name D:\New folder

Using PowerQuery to extract only column headers from multiple files

DashboardNovice

Member

Attachments

Excel Wizard

Active Member

Attachments