How to get a random sample of data with Power Query

Share

Facebook
Twitter
LinkedIn

This Power Monday trick is about random sample with Power Query. This is based on my experience of working with large volumes of data.

The other day I have been building a hotel dashboard (more on this later). As part of the dashboard, I wanted to show a random sample of user reviews. Reviews database had quite a few rows, so I wanted to extract a randomized sample of 100 reviews and show them in the report. When you refresh the report (Data > Refresh), then a new set of reviews will be fetched and shown.

howto get random sample in power query

Let’s learn how to generate a random sample with Power Query in this article.

This tutorial works in Power Query for Excel or Power BI. In case of Excel, the output sample will be either loaded as table or to data model. In case of Power BI, output goes to your data model.

If you want to get random sample with Excel formulas, read this.

5 Steps to create random sample with Power Query

Step 1: Get your data to Power Query

Simple. Grab the data you want to sample and bring it to PQ. At this point, you will get something like this:

random sample with power query - data

Step 2: Add Random Numbers as a column

Go to “Add Column” > Custom Column and add this formula.

=Number.Random()

Remember: Power Query formulas are case-sensitive. So type exactly. Name this column “Random”

But Power Query gives same random number in all rows …

That is right. As Power Query is a parallel language, each row gets same random number (unlike Excel’s RAND() filled down a column).

Note: your experience with Number.Random() could be different, but as you build transformations, at some point PQ will replace all numbers with same value.

So how to get different numbers per row? Simple, we force PQ to evaluate something per row. A simple thing like index number column will do. This will force PQ to run random formula for all rows.

Hat tip to Gil Raviv for suggesting this technique in a forum post.     

Step 3: Add Index Number column & Sort the random numbers

Go to “Add column” > Index number. Now that we have index numbers in a column, this will force PQ to regenerate the random number per row.

add an index number column

Select the random number column and sort it.

Note: You may need to switch Steps 2 & 3 if the random numbers are same all the way thru.

Step 4: Keep top 100 rows

Go to Home > Keep Rows > Keep Top Rows. Enter the sample size you want (100) and Click OK. Your sample is ready.

keep top random rows

Step 5: Remove the Random & Index columns

Now that our sample is ready, let’s remove the random & index number columns. We do not need them in the final output (or model). Click on Save & Load (or Close & Apply).

Enjoy the sample.

How to get random sample with repetitions?

The above technique gives a sample without repetitions. What if you need a sample with repetitions (ie memory-less sampling). For example, a series of dice throws or coin tosses?

We can use Power Query to get such samples too. This is slightly complicated compared to first technique, but fun to try.

  1. Load your source to PQ
  2. Group the data so you can get row count (while still keeping the data). Like this:
    Advanced grouping in Power Query for random sampling with repetitions
  3. Add a custom column with a list of 100 numbers =List.Numbers(1,100)
  4. Expand the list to new rows
  5. Add a column with random number  between 0 & row count-1 =Number.RandomBetween(0,[Count]-1))
  6. Add index column
  7. Change random number to whole number
  8. Extract the random row number from [Data] to a new column =[Data]{[Random]}
  9. Remove all other columns except this new column in #8
  10. Expand the column
  11. Your sample with possible repetitions is ready.

Here is the full M code for you to customize.

let
    Source = Excel.CurrentWorkbook(){[Name="myData"]}[Content],
    #"Grouped Rows" = Table.Group(Source, {}, {{"Count", each Table.RowCount(_), type number}, 
{"Data", each _, type table}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "List", each List.Numbers(1,100)),
    #"Expanded List" = Table.ExpandListColumn(#"Added Custom", "List"),
    #"Added Custom1" = Table.AddColumn(#"Expanded List", "Random", 
each Number.RandomBetween(0,[Count]-1)),
    #"Added Index" = Table.AddIndexColumn(#"Added Custom1", "Index", 0, 1),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Index",{{"Random", Int64.Type}}),
    #"Added Custom2" = Table.AddColumn(#"Changed Type", "Custom", each [Data]{[Random]}),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom2",{"Data"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Removed Columns",{"Count", "List", "Random", "Index"}),
    #"Expanded Custom" = Table.ExpandRecordColumn(#"Removed Columns1", "Custom", {"Review Text", "Rating"},
 {"Review Text", "Rating"})
in
    #"Expanded Custom" 

Answers to your questions about sampling…

How to get another sample?

Simple. Just refresh your Power Query connection. You will get another sample.

How to change the sample size?

In the M code, where it says 100 replace with another number or parameter.

Use Excel Cell to tell Power Query how big a sample you want…

You can even use an Excel named cell to tell PQ what sample size you want. Assuming named cell sample.size has the size, use this M code  =Excel.CurrentWorkbook(){[Name=”sample.size“]}[Content][Column1]{0} to get the value in your query. Use it as part of other steps and bingo, your sample size changes.

Other questions…?

Struggle sampling some sensible set? Post your sample problem in comments so I or one of our excellent readers can help you.

Download sample file and get your samples…

Excuse the pun, but here is a sample file with all the M code for making your own samples. Examine the queries to learn how this is done.

How do you sample?

Excel’s Rand() is my favorite way to sample. But now that I am spending more time with Power Query & Power BI, I needed another way to sample the data. This post outlines my preferred approach (unless I am dealing with very large volumes of data) For large volumes of data, I suggest sampling at server-side thru SQL.

What about you? How do you sample? Share your approach or troubles in the comments.

New to Power Query? Check out this introduction tutorial.

Facebook
Twitter
LinkedIn

Share this tip with your colleagues

Excel and Power BI tips - Chandoo.org Newsletter

Get FREE Excel + Power BI Tips

Simple, fun and useful emails, once per week.

Learn & be awesome.

Welcome to Chandoo.org

Thank you so much for visiting. My aim is to make you awesome in Excel & Power BI. I do this by sharing videos, tips, examples and downloads on this website. There are more than 1,000 pages with all things Excel, Power BI, Dashboards & VBA here. Go ahead and spend few minutes to be AWESOME.

Read my storyFREE Excel tips book

Overall I learned a lot and I thought you did a great job of explaining how to do things. This will definitely elevate my reporting in the future.
Rebekah S
Reporting Analyst
Excel formula list - 100+ examples and howto guide for you

From simple to complex, there is a formula for every occasion. Check out the list now.

Calendars, invoices, trackers and much more. All free, fun and fantastic.

Advanced Pivot Table tricks

Power Query, Data model, DAX, Filters, Slicers, Conditional formats and beautiful charts. It's all here.

Still on fence about Power BI? In this getting started guide, learn what is Power BI, how to get it and how to create your first report from scratch.

28 Responses to “FIFA Worldcup 2018 Excel Tracker – FREE Download”

  1. Andy F says:

    Good work as always - I liked the way you did the "menu" on the left hand side (although the buttons aren't lined up between tabs if I'm being ultra picky)

    Have you previously written about the method of extracting the Wikipedia page into Power Query? It's not something I recall seeing before.

    ps other geeky observsations:

    - the bracket columns are too narrow for the date & match number - and will need to be wider still when the team names get populated
    - match 51 should be Moscow (Luzhniki) for consistency
    - it's not possible to be 23 hours ahead of GMT - the International Dateline gets in the way! I think the maximum is 14. There are also a couple of countries who work to a quarter hour to make it really complicated!
    - There's a typo in the how-to - "compated" instead of compared

    • Chandoo says:

      Thanks for the lovely feedback. I have fixed almost all of them.

      1) button alignment: this is tricky as row heights can change between sheets.
      2) Column width is fixed now so bracket view looks better
      3) Updated the stadium name
      4) Did not bother with the 23 hours ahead thingie. This is more of a novelty feature 😛
      5) Fixed the type
      6) Fixed an issue with live score table. This should work as long as the points table is maintained in wikipedia page - https://en.wikipedia.org/wiki/2018_FIFA_World_Cup

      7) I have not discussed the technique of reading all tables on webpage to one big table. Watch out for a blog post on this soon.

      • Andy F says:

        Button alignment is one for the ultra-OCD sufferers 😉 There are ways, but only for those with too much time on their hands.

        • Chandoo says:

          Aah, Excel. The perfect tool for people like us. Everything (cells) is in same shape and size by default and aligned perfectly. 😀

          • Etienne says:

            Is there actually a way to copy row heights (in the same way you copy column widths?)

            By the way Chandoo, great post. I'm forwarding to my department. I actually use another query from the same page to automatically fill in the team names for the knock-out stages (I made one for round of 16 which I then duplicated and edited for quarter-finals etc.) This is incredible, I was always wondering how to do these type of queries from the web, and now I know 🙂

  2. Jake says:

    Hello!

    This is quite amazing and incredibly cool to use 🙂

    Testing the constraints of this sheets a few errors popped I noticed:
    - Vlookup Group E-H refers to column J instead of E (eg. Brazil gets the same points as Russia because the formula looks up Russia twice)
    - Power query only has 29 lines, the overview of has 32 but the 3 countries from group A are lost as the overview is refreshed - causing N/A in the group stage colums

    • Chandoo says:

      @Jake.. thank you. I am sorry for the errors. I could not test the live points table until the games began. I see my folly now. I have fixed both issues and uploaded a new file. As the points table relies on a wikipedia page, if someone decides to change the layout or rename a column it can seriously harm this template. I took some precautions in the Power Query layer to adjust column names dynamically etc, but it is not foolproof.

      Try downloading the newer version and let me know if you see something funny.

      • Jake says:

        No worries!
        Was able to fix the vlookup myself but the power query had me bit stumped 🙂 And wanted to give you a heads-up to everyone can enjoy it!

        Thanks for the awesome sheet!

  3. Darren G says:

    Hi,
    Thanks for sharing this world cup tracker. Certainly makes it more interesting when the data is current. As a newbie, it also helps to have a couple of mistakes to find whether unintentional or not.

    Thanks again

  4. Christian says:

    Hi,

    Your v-lookups in the "Group Stage" tab for groups E, F, G, and H (all the ones under column O) are pointing to the wrong country. They all point to column J, so whatever happens to the countries in column J will also be reflected for the countries in the groups in column O for that same row.
    Just thought I'd call that out. Thanks for the great work on this!

    • Chandoo says:

      @Christian... Thanks for trying this and letting me know about lookups. I have fixed the issue now. Please download latest version for that and few more fixes.

  5. Sheeloo says:

    Refresh All did not work correctly. Team names vanished though points were updated.

    • Chandoo says:

      @Sheeloo... Can you please try with latest version (download again using above links). I tested up to latest Iran's stunning win over Morocco and it works.

  6. Gsm says:

    Dear Chandoo

    Thanks a lot for this worksheet.

    However, while refreshing the data, I am getting error message as "Initialisation of Data Source failed".

    • Chandoo says:

      May I know what version of Excel you are using? Do you have internet connectivity? If you are familiar with Power Query, try tracing the steps in the query editor. And oh, first start with the latest version of file (link above).

  7. Andy F says:

    @Etienne - yes. Copy row, paste formats will do it, although obviously that will bring the formats of every cell in the row as well as the height.

  8. Rob Tsintas says:

    Latest version seems to be working well.

    One request: the Groups & Points tables on the Group Stage sheet have the team names pre-entered. This means they don't get sorted according to the results.
    On my copy, I've changed them to a lookup, so they appear in the same order as the points table. It would be good if you can do the same if/when you release a new update!

    Here's what I did. It's not the most elegant, but it works, and I didn't have much time to spend on it!

    Using helper values of 1,2,3,4 in columns I and N for each group, the formula for the first team name in group A (cell J4) is:

    =INDEX(points[Team],MATCH(OFFSET(J4,-(I4),0),points[Group],0)+(I4-1))

    This can be copied & pasted to the other team name cells.

    Cheers!

    • Chandoo says:

      Good suggestion. I have made changes to the points table to remove lookups and just show teams in the order they appear in the detailed table. This way, You will see top two teams on first two rows. We could highlight them as well (figured this would make it look like a bowl of M&Ms, so didn't bother) or highlight *YOUR* team.

  9. Paige says:

    I consider my Excel skills as above average but far from guru and I love how your little projects like this get me to look at data in a new way. I would like to expand on the data in the points table through the use of some calculations but I am a little challenged by the data coming across as text. The Pts column is easy to deal with, but I'm having problems with the GD. The negative goal differential looks like it may be noted with an en dash instead of a minus sign, but if I search for an en dash in the data Excel doesn't find any. I would like to include conversion to a minus sign in my little macro so I can get everything to numbers but so far I am not having any luck. Any thoughts? Thanks for your help.

    • Chandoo says:

      Thanks for such kind words 🙂

      I suggest adding an extra step in Power Query to convert points, GD & other columns to numbers. You can replace em dash in PQ. I did not do it as this will add another layer of dependency and should the wikipedia page change, one more reason for the query to fail.

  10. Petros says:

    As always, an awesome spreadsheet from Chandoo. I love the Power Query score update without macros. The country watch-out is a unique feature as well!

    For those who like a predictor template with flag lookup and a ribbon UI, here is our spreadsheet:

    https://www.spreadsheet1.com/fifa-world-cup-2018-russia-free-prediction-templates-for-excel.html

  11. Pranav says:

    Great template!

    I came across another one with image vlookups for country flags

    https://eexcel.co.uk/downloads (World_Cup_2018_Sweepstake.xlsx)

  12. Sean says:

    This is a great Template.
    I am running Excel 2010 with the PowerQuery add-in running.
    The scores will not update, so I followed the error and the second operation (Fitlered rows) says that the table is empty.
    After a few minutes on Wikipedia, I realise that my PowerQuery skills are not good enough to work out what the issue is.
    Any suggestions?
    I would like to fix it myself is possible.
    Thanks,
    Sean.

  13. Juan Pablo Diez says:

    Where can I see the results for a specific match?

    Thanks!

    • Chandoo says:

      @Juan... You can now. I have included a results tab that shows match scores. This too is a live table. Just refresh data to get new results. Please download latest version file from links above to use this feature.

      PS: There is another version coming soon with all goals too. I just have to spend some more time polishing the Json to table Power Query thingie.

Leave a Reply