How to get a random sample of data with Power Query

Share

Facebook
Twitter
LinkedIn

This Power Monday trick is about random sample with Power Query. This is based on my experience of working with large volumes of data.

The other day I have been building a hotel dashboard (more on this later). As part of the dashboard, I wanted to show a random sample of user reviews. Reviews database had quite a few rows, so I wanted to extract a randomized sample of 100 reviews and show them in the report. When you refresh the report (Data > Refresh), then a new set of reviews will be fetched and shown.

howto get random sample in power query

Let’s learn how to generate a random sample with Power Query in this article.

This tutorial works in Power Query for Excel or Power BI. In case of Excel, the output sample will be either loaded as table or to data model. In case of Power BI, output goes to your data model.

If you want to get random sample with Excel formulas, read this.

5 Steps to create random sample with Power Query

Step 1: Get your data to Power Query

Simple. Grab the data you want to sample and bring it to PQ. At this point, you will get something like this:

random sample with power query - data

Step 2: Add Random Numbers as a column

Go to “Add Column” > Custom Column and add this formula.

=Number.Random()

Remember: Power Query formulas are case-sensitive. So type exactly. Name this column “Random”

But Power Query gives same random number in all rows …

That is right. As Power Query is a parallel language, each row gets same random number (unlike Excel’s RAND() filled down a column).

Note: your experience with Number.Random() could be different, but as you build transformations, at some point PQ will replace all numbers with same value.

So how to get different numbers per row? Simple, we force PQ to evaluate something per row. A simple thing like index number column will do. This will force PQ to run random formula for all rows.

Hat tip to Gil Raviv for suggesting this technique in a forum post.     

Step 3: Add Index Number column & Sort the random numbers

Go to “Add column” > Index number. Now that we have index numbers in a column, this will force PQ to regenerate the random number per row.

add an index number column

Select the random number column and sort it.

Note: You may need to switch Steps 2 & 3 if the random numbers are same all the way thru.

Step 4: Keep top 100 rows

Go to Home > Keep Rows > Keep Top Rows. Enter the sample size you want (100) and Click OK. Your sample is ready.

keep top random rows

Step 5: Remove the Random & Index columns

Now that our sample is ready, let’s remove the random & index number columns. We do not need them in the final output (or model). Click on Save & Load (or Close & Apply).

Enjoy the sample.

How to get random sample with repetitions?

The above technique gives a sample without repetitions. What if you need a sample with repetitions (ie memory-less sampling). For example, a series of dice throws or coin tosses?

We can use Power Query to get such samples too. This is slightly complicated compared to first technique, but fun to try.

  1. Load your source to PQ
  2. Group the data so you can get row count (while still keeping the data). Like this:
    Advanced grouping in Power Query for random sampling with repetitions
  3. Add a custom column with a list of 100 numbers =List.Numbers(1,100)
  4. Expand the list to new rows
  5. Add a column with random number  between 0 & row count-1 =Number.RandomBetween(0,[Count]-1))
  6. Add index column
  7. Change random number to whole number
  8. Extract the random row number from [Data] to a new column =[Data]{[Random]}
  9. Remove all other columns except this new column in #8
  10. Expand the column
  11. Your sample with possible repetitions is ready.

Here is the full M code for you to customize.

let
    Source = Excel.CurrentWorkbook(){[Name="myData"]}[Content],
    #"Grouped Rows" = Table.Group(Source, {}, {{"Count", each Table.RowCount(_), type number}, 
{"Data", each _, type table}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "List", each List.Numbers(1,100)),
    #"Expanded List" = Table.ExpandListColumn(#"Added Custom", "List"),
    #"Added Custom1" = Table.AddColumn(#"Expanded List", "Random", 
each Number.RandomBetween(0,[Count]-1)),
    #"Added Index" = Table.AddIndexColumn(#"Added Custom1", "Index", 0, 1),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Index",{{"Random", Int64.Type}}),
    #"Added Custom2" = Table.AddColumn(#"Changed Type", "Custom", each [Data]{[Random]}),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom2",{"Data"}),
    #"Removed Columns1" = Table.RemoveColumns(#"Removed Columns",{"Count", "List", "Random", "Index"}),
    #"Expanded Custom" = Table.ExpandRecordColumn(#"Removed Columns1", "Custom", {"Review Text", "Rating"},
 {"Review Text", "Rating"})
in
    #"Expanded Custom" 

Answers to your questions about sampling…

How to get another sample?

Simple. Just refresh your Power Query connection. You will get another sample.

How to change the sample size?

In the M code, where it says 100 replace with another number or parameter.

Use Excel Cell to tell Power Query how big a sample you want…

You can even use an Excel named cell to tell PQ what sample size you want. Assuming named cell sample.size has the size, use this M code  =Excel.CurrentWorkbook(){[Name=”sample.size“]}[Content][Column1]{0} to get the value in your query. Use it as part of other steps and bingo, your sample size changes.

Other questions…?

Struggle sampling some sensible set? Post your sample problem in comments so I or one of our excellent readers can help you.

Download sample file and get your samples…

Excuse the pun, but here is a sample file with all the M code for making your own samples. Examine the queries to learn how this is done.

How do you sample?

Excel’s Rand() is my favorite way to sample. But now that I am spending more time with Power Query & Power BI, I needed another way to sample the data. This post outlines my preferred approach (unless I am dealing with very large volumes of data) For large volumes of data, I suggest sampling at server-side thru SQL.

What about you? How do you sample? Share your approach or troubles in the comments.

New to Power Query? Check out this introduction tutorial.

Facebook
Twitter
LinkedIn

Share this tip with your colleagues

Excel and Power BI tips - Chandoo.org Newsletter

Get FREE Excel + Power BI Tips

Simple, fun and useful emails, once per week.

Learn & be awesome.

Welcome to Chandoo.org

Thank you so much for visiting. My aim is to make you awesome in Excel & Power BI. I do this by sharing videos, tips, examples and downloads on this website. There are more than 1,000 pages with all things Excel, Power BI, Dashboards & VBA here. Go ahead and spend few minutes to be AWESOME.

Read my storyFREE Excel tips book

Overall I learned a lot and I thought you did a great job of explaining how to do things. This will definitely elevate my reporting in the future.
Rebekah S
Reporting Analyst
Excel formula list - 100+ examples and howto guide for you

From simple to complex, there is a formula for every occasion. Check out the list now.

Calendars, invoices, trackers and much more. All free, fun and fantastic.

Advanced Pivot Table tricks

Power Query, Data model, DAX, Filters, Slicers, Conditional formats and beautiful charts. It's all here.

Still on fence about Power BI? In this getting started guide, learn what is Power BI, how to get it and how to create your first report from scratch.

13 Responses to “Gantt Box Chart Tutorial & Template – Download and Try today”

  1. Oli says:

    Hi Chandoo

    As one of your students I have followed your detailed example through with great success. However, Excel is acting in an unexpected way and I wonder if you could take a look?
    http://cid-95d070c79aef808e.office.live.com/self.aspx/.Public/Gantt%20Box%20Chart.xlsm
    On my version, I have to type 40239 (Which equates to 2 Mar 2010) to get the chart to display 31 May 2010 (which should be 40329)!!??

    Have I done something wrong or is Excel acting up?

    Thx
    Oli
    PS Your example file in 2007 displays correctly.

  2. Dave says:

    Hi,

    I like this idea a lot, but I agree the name is a little drab.

    As an American I may just be seeing things, but to me the combination of lines and bars on your chart looks like a bunch of cricket bats.

    Maybe you could work that into a catchier name. 🙂

    Cheers!

  3. Bob says:

    Here is some code I use to keep the axis synched.
    It may be useful to some of your readers
    It is based on a comment I saw on Daily Dose of Excel.

    Function SynchGanttAxis(Cname, lower, upper)
    'Sets the X min and X max for Category axis

    Application.Volatile

    On Error Resume Next
    '
    'Top Horizontal Axis
    With ActiveSheet.Shapes(Cname).Chart.Axes(xlCategory, 1)
    .MinimumScale = lower
    .MaximumScale = upper
    End With

    'Bottom Horizontal Axis
    With ActiveSheet.Shapes(Cname).Chart.Axes(xlValue, 2)
    .MinimumScale = lower
    .MaximumScale = upper
    End With

    End Function

    Function SynchVerticalAxis(Cname, lower, upper)
    Application.Volatile
    On Error Resume Next
    ' Excel 2007 only
    'Right hand vertical axis
    With ActiveSheet.Shapes(Cname).Chart.Axes(xlValue, 1)
    .MinimumScale = 0
    .MaximumScale = upper
    End With

    End Function

  4. Chandoo says:

    @Oli.. Can you check your file again.. I see 40329...

    @Dave: Even I saw things.. the bars actually looked like lollipops. How about calling this lollipop chart - now that would be yummy and goes along the tradition of naming charts after eatables (bar, pie, donut...)

    @Bob: Superb stuff... thanks for sharing 🙂

  5. Mike H says:

    Hi Chandoo
    This looks really good and I think it can also be applied to show project phases / milestones.

    Question: Thinking further could this be amended to display a project lifecycle (Idea through to Implementation say 7 phases) on one bar / row? Just imagine 20 projects within a programme all on one chart one bar each showing their respective lifecycle stages i.e. on one page.

    Idea: As the Gantt Box Chart this is quite intensive to set up re formatting etc how about the added extra of once you have completed this to "Save as template" i.e. saves the formatting and layout of the chart as a template so you can apply to future charts. Simple to do and will save the time formatting etc again and again and again.
    Therefore tip: Click on your chart demo and then click on Save As template icon (2007) - edit file name and click on save. Ready to use / apply via Templates in Change Chart Type window.

    Thanks and be very interested if the lifecycle question can be resolved

    Mike

  6. Oli says:

    How embarrassing.

    I was obviously suffering from numerical dyslexia. I was one of those days.

  7. Chandoo says:

    @Mike H: You can easily make this chart to work like a generic project lifecycle plan chart. All you have to do is,

    1. in a separate sheet define the steps of lifecycle and various dates in a table (with 5 columns for each of the projects you have).
    2. now use a control cell to input the project name you want to show in the chart
    3. based on the input, use OFFSET Formulas to get the correct data
    4. Rest is same as the tutorial above

    For more info on the dynamic charting visit http://chandoo.org/wp/tag/dynamic-charts/ and http://chandoo.org/wp?s=OFFSET

  8. Your solution is really smart but in the en Excel isn't meant to do stuff like this. I, as a former PM, always thought is was frustrating that you had to do stuff like this for something simple like a Gantt chart. So I built Tom's Planner. And would like to plug it here. I think it really solves the problem you are trying to solve in the most efficient way. Check out http://www.tomsplanner.com for a free account or play around with the demo.

  9. Lopi says:

    Hi there,
    Chandoo - this is really a very nice and helpfull chart - I adopted it, so I can report a forecast or the delay of a certain task (coming from my role as an auditor for projects).
    One topic I´m currently struggeling with: I do have a project lasting for lets say 12 month. For a management reporting, I want to have kind of snapshot, lets say one month back and 2 month in the future. I tried with the offset formula, but failed. Any idea?
    Thx
    Lopi

  10. [...] Ein viel geliebter Klassiker ist die Erstellung von GANTT-Diagrammen mit Excel. Wir hatten das Thema wiederholt schon hier. Chandoo.org hat sich mal wieder mit einer neuen Variante hervorgetan: Das GANTT-Box-Chart. [...]

  11. David says:

    Hi Chandoo - fantastic xls. One thing I can't figure out how to do is adjust the alignment of the vertical axis. I would like to left align so that I could indent to represent sub tasks. Can that be done? Or is there a better way?

  12. Paul says:

    I've been trying to work out if there's a way to show weekends on the graph. The closest thing I've got is to add them on a secondary axis, but then I haven't been able to keep both axis lined up together! Any ideas?

    Following on from this - is it possible to show things like holidays?

Leave a Reply