Recently, Bluetaurean asked in the Chandoo.org Forums about ways to allocate work durations for various product lines across 24 hour days to create a daily schedule.
Both formula-based and VBA-based solutions were offered.
Today at formula Forensics we will take a look at the formula-based approach.
As always at Formula Forensics you can follow along, Download Here – Excel 2007-2013.
Set the Scene
Since one might encounter a similar need in a variety of contexts (manufacturing, engineering, project planning, etc.), we will look at a more general problem of allocating a set of tasks and corresponding durations to one or more days, as shown in the following diagram.
We will create two output views:
- One that is a flat list that can then be manipulated further using Excel’s Pivot table feature, and
- Another view that mimics a pivot-table (and is similar to a typical project Gantt view, but with actual values listed instead of a bar chart).
You can follow along using the attached Excel document. Download here Excel 2007+
Problem Specifics
- We have a list of tasks and their durations.
- We need to distribute the tasks to different days, without exceeding the maximum available duration in a given day.
- When the hours in a day are “used up”, we need to allocate the remaining task duration to the next day, and so on.
- On the other hand, if a given task does not use up all of the hours in a given day, we will need to assign more than one task for that day, provided the combined durations do not exceed the available hours for that day.
- In other words, we will need to split a task across one or more days, or combine one or more tasks into a single day, as needed, to maximize the work performed in a given day.
Developing the Approach
Before we tackle this problem in Excel, let us review how we might do this manually. Like most things, we might use the following three step process:
- Take the first task and assign its duration to Day 1. If the task’s duration exceeds the maximum hours available in a day, allocate the portion of the duration that does not fit into Day 1 into Day 2.
- Take the second task, and see whether it can fit into an existing day, or whether it needs to be distributed to multiple days
- Etc. (OK… so that three-step process was a stretch!)
Statistics show that most people think in terms of IF-THEN-ELSE statements. So here it is…
For a given Day, and for a given Task, If [Hours Not Allocated For that Task] > [Hours Available for that Day] Then Set Duration for that Day as [Hours Available for that Day] Else Set Duration for that Day as [Hours Not Allocated for that Task] EndContinue the above evaluation until all tasks have been allocated to days.
Of course, the above IF() logic can be condensed as follows:
MIN( [Hours Not Allocated For that Task] , [Hours Available for that Day] )
Putting it All Together: Output Option 1: Gantt-like View
Let us employ the above approach to create the Gantt-like view.
To make our approach more generic, we will use an Excel Name called “MaxHrsPerDay” to indicate the maximum available hours in a given day. (In the sample worksheet, it has been set to 24 hours.)
Our source data is setup as shown in the diagram below:
- Tasks are in the range A2:A5
- Durations are in the range B2:B5
We will create the output in a separate worksheet, in the range A1:E5 as shown below:
Put the following formula into cell A2 and copy down to A5:
=SourceData!$A2
(This formula is merely referencing the values from the SourceData sheet. The sample workbook also includes an approach to make this reference more location independent.)
Put the following formula in cell B2, and copy it down and right:
=MIN((SourceData!$B2-SUM($A2:A2)), (MaxHrsPerDay-SUM(B$1:B1)))
Setup the header row (B1:E1) as desired. (I have used text values for the header. You could also calculate the header text using formulas. Since that is straightforward, I will leave that as an exercise for the reader.)
Now let us look at what the formula in cell B2 is doing:
- SUM($A2:A2) is calculating the sum of the allocated durations for TaskA. (Please note the use of absolute and relative references. The formula is anchored on column A, but the starting row, ending row and ending column are free to expand.) SUM($A2:A2) returns zero since SUM() ignores text values.
– If you look at cell C2, the reference changes to SUM($A2:B2).
– In cell B3, the reference changes to SUM($A3:A3). You get the idea
- (SourceData!$B2-SUM($A2:A2)) calculates the difference between the duration for TaskA (40 in the example) and the hours allocated as of that point (0), to return 40-0=40.
- SUM(B$1:B1) is calculating the sum of the allocated hours for Day1. (Again, we are using a combination of absolute and relative references to keep the calculation anchored on column B.) In this case, the value is zero, since this is the first allocation for Day1.
- (MaxHrsPerDay-SUM(B$1:B1)) calculates the hours remaining (i.e. available) for Day1. Since this is for cell B2, the calculation returns 24 – 0 = 24.
That is it!
We put those absolute and relative references to good use!
This approach was easy because all we had to do was calculate the duration for a given task for a given day.
On the other hand, if we had to figure out what the Task was, or which Day it was, the calculation gets a little more involved. Since this is “formula forensics”, we would not have it any other way! 🙂
Putting it All Together: Output Option 2: A Sequential List of Tasks and Durations for Each Day (i.e. a Flat List)
As before, we will use the Excel Name “MaxHrsPerDay” to refer to the maximum hours in a Day.
As shown in the following diagram, we will turn the source data into a flat list of Days, Tasks and Durations:
Unlike with VBA, since a formula cannot choose which row and column to write its output, we have to set the formula in every cell where we suspect there might be a value.
In the above sample diagram, we copy the formulas from row 2 to row 9. However, row 9 shows “…” indicating that the list was completed by row 8.
Let us look at how to determine the value for Day, Task and Allocated Duration.
For ease of description, I have created the following Excel Names:
WorkList: =A2:A5 in the source data.
WorkDuration: =B2:B5 in the source data
While creating the Gantt-like view earlier, we were able to take advantage of the static “Day” and “Task” values to determine the Remaining Duration, Available Duration, etc. Since we now have to determine all three values (Day, Task, Allocated Duration), we will need some “helper” data.
We will add a column alongside the source data that shows the cumulative duration (for reasons that will become clear shortly), as shown in the following diagram:
Cumulative Duration is calculated as the sum of all durations up to a given row.
- For example, in cell C2, the Cumulative Duration is 40.
- In cell C3, the Cumulative Duration is 40+20=60
- And so on.
For ease of referencing, we will use an Excel Name called CumulativeDuration =C2:C5.
Let us look at why we need the “CumulativeDuration” helper column:
The circular logic problem
In order to determine the durations already allocated for a given day, we will need to know which Day it is.
We also need to know which Task we are trying to calculate the duration for.
So… do we calculate the Day or the Task or the Duration first?!! As you can imagine, that will soon land us in some circular logic.
Some helpful observations about the output:
- In column C of the output (on worksheet FlatList), the sum of allocated durations adds up to the total duration for all tasks. (No surprise here!)
- If every task had duration equal to the MaxHrsPerDay, you would have the same duration value for all days. (Not surprising, but interesting!)
- In other words, you could think of the Allocated Duration column as the total duration for all tasks, allocated MaxHrsPerDay at a time.
- Now we need a way to iterate through the duration values one at a time and account for the durations already processed. In other words, each value needs to contain all of the previous values. Welcome to an array of the cumulative durations!
- For example, in the cumulative array “{40;60;65;80}”, the value 60 already includes the previous value 40 in it. This allows us to subtract all durations allocated up to a given row, to get the duration value that is remaining to be allocated.
- Since Excel is good with numbers, we will base the calculation for AllocatedDuration and Tasks on the Duration values.
- By calculating the two values separately, we avoid the circular logic.
Let’s now look at the formulas for Day, WorkItem and AllocatedDuration.
It would be easier if we looked at the formulas in reverse order, starting with AllocatedDuration, then WorkItem, and finally Day.
Formula for “AllocatedDuration”
Enter the following formula into cell C2, ending with Ctrl+Shift+Enter, as shown in the following diagram:
=IF(SUM(C$1:C1)>=SUMPRODUCT(WorkDuration), “…”,MIN(INDEX(WorkDuration, MATCH(TRUE, CumulativeDuration-SUM(C$1:C1) > 0, 0)) – SUMIFS(C$1:C1, B$1:B1,B2), MaxHrsPerDay-SUMPRODUCT((A$1:A1=A2)* IF(ISNUMBER(C$1:C1), C$1:C1, 0)))) Ctrl+Shift+Enter
Let us look at the formula closely (using the formula in row 2):
- SUMPRODUCT((A$1:A1=A2)* IF(ISNUMBER(C$1:C1), C$1:C1, 0)) -> This calculates the sum of all allocated durations up to the previous row, where the Day = current row’s day. Please note the use of absolute and relative references. They allow us to expand the range as we go down the rows, while remaining anchored to the first row.
– Since this is the first data row, C$1:C1 returns “Allocated Duration” and the ISNUMBER() function returns FALSE, and consequently, the IF() function returns 0.
– A$1:A1 returns “Day”, and the test A$1:A1=A2 returns FALSE. Please note that in this case, it does not matter whether A2 has a value in it, whether it has the value 1, etc.
– SUMPRODUCT() provides the result of FALSE * 0 = 0
- MaxHrsPerDay – SUMPRODUCT((A$1:A1=A2)* IF(ISNUMBER(C$1:C1), C$1:C1, 0)) -> This calculates the difference between maximum duration available for a day and the sum of durations allocated for the current day. In other words, it calculates the available duration for the current row’s day.
– In this example, the calculation results in MaxHrsPerDay (24 in our example) – 0 = 24
- SUMIFS(C$1:C1, B$1:B1,B2) -> This calculates the sum of all allocated durations for the current row’s task. Since B$1:B1 is the text value “Work Item”, the SUMIFS() returns 0. Again, it does not matter if B2 is blank or has a value like “TaskA”, since Excel correctly evaluates the condition whether B$1:B1 equals B2.
- SUM(C$1:C1) -> This calculates the sum of all allocated durations up to the previous row.
- CumulativeDuration — SUM(C$1:C1) -> CumulativeDuration evaluates to {40;60;65;80}. SUM(C$1:C1) evaluates to zero. As such, the expression evaluates to {40;60;65;80} – 0, or {40;60;65;80}.
– If we look at the calculation for this expression in cell C3 (the expression would be “CumulativeDuration—SUM(C$1:C2)”), we would get the result of {40;60;65;80} – (0+24) = {16;36;41;56}. (As you know, subtracting a scalar value from an array results in an array with each value reduced by the scalar value.)
– If we look at the calculation for this expression in cell C4 (the expression would be “CumulativeDuration—SUM(C$1:C3)”) , we would get the result of {40;60;65;80} – (0+24+16) = {0;20;25;40}
– As you can see, each successive calculation reduces the CumulativeDuration array by the amount of hours already allocated. By reducing the CumulativeDuration array in this fashion, we ensure that we do not “double count” a duration.
– If a value in the array evaluates to zero, it means the corresponding duration has been fully allocated. (In cell C3, the first value in the array is zero, indicating that the original 40 hours has been fully allocated.) We will put this knowledge to good use in the next expression.
- MATCH(TRUE, CumulativeDuration—SUM(C$1:C1) > 0, 0) -> The expression CumulativeDuration—SUM(C$1:C1) > 0 evaluates to ={TRUE;TRUE;TRUE;TRUE} because all values are greater than zero. By performing a MATCH() for TRUE, we are able to find the first location in the array that has a non-zero value.
– If we look at the result of this expression in cell C3, we get {16;36;41;56} > 0 = {TRUE;TRUE;TRUE;TRUE}
– If we look at the result of this expression in cell C4, we get {0;20;25;40} > 0 = {FALSE;TRUE;TRUE;TRUE}
– As you recall, the zero values (or FALSE) correspond to the durations that have been fully allocated, whereas, the non-zero values (or TRUE) correspond to the durations that have NOT been fully allocated.
– It is helpful to note that MATCH() returns the LOCATION of what it finds. As such, the returned location is that of the first duration value that has not been fully allocated! Since the CumulativeDuration array is the same size as the WorkDuration array, we will be able to put this returned location value to good use in the next expression.
- INDEX(WorkDuration, MATCH(TRUE, CumulativeDuration — SUM(C$1:C1) > 0, 0)) -> By using the location value (of the first duration value that has not been fully allocated), we find the corresponding original duration value from the WorkDuration array.
– As we saw earlier, the expression “CumulativeDiration – SUM(C$1:C1)” reduces the CumulativeDuration by the duration values allocated to that point. However, the resulting array could have partial duration values as well. By referencing the corresponding duration value from the WorkDuration array, we ensure that we retrieve the original (full) duration value that was to be allocated.
- MIN(…) -> This expression calculates the value of MIN([Hours Not Allocated For that Task], [Hours Available for that Day])
– [Hours Not Allocated For that Task] is returned by INDEX(WorkDuration, MATCH(TRUE, CumulativeDuration—SUM(C$1:C1) > 0, 0)) – SUMIFS(C$1:C1, B$1:B1,B2)
– [Hours Available for that Day] is returned by second half of the MIN() expression: MaxHrsPerDay—SUMPRODUCT((A$1:A1=A2)* IF(ISNUMBER(C$1:C1), C$1:C1, 0)).
– So, we essentially got back to the logic we started from, which is the same logic we used for creating the Gantt-like view as well.
- The remaining portion of the formula (the IF() check) determines if all of the hours have been allocated. If all hours have been allocated, it returns “…”.
– SUMPRODUCT(WorkDuration) -> This expression calculates the total of all work duration values. In cell C2, it evaluates to SUMPRODUCT({40;20;5;15}) = 80
– SUM(C$1:C1)>=SUMPRODUCT(WorkDuration) -> Determines if the sum of durations allocated up to that point is greater than the total for all durations. (Since this is part of an array formula, you could also use the SUM function in place of SUMPRODUCT. But I am partial to the SUMPRODUCT function!! So, unless you are in a competition where the winner is determined by the shortest formula, feel free to use either one!
Formula for “WorkItem”
Enter the following formula into cell B2, ending with Ctrl+Shift+Enter, as shown in the following diagram.
=IF(SUM(C$1:C1)>=SUMPRODUCT(WorkDuration), “…”,INDEX(WorkList, MATCH(TRUE, (CumulativeDuration-SUM(C$1:C1)) > 0, 0))) Ctrl+Shift+Enter
You are already familiar with most of the formula components since you saw them in the formula for AllocatedDuration. The only difference is that in this formula, we are returning a value from WorkList. (i.e. we locate the position of the first non-zero duration in CumulativeDuration array, and since that array is the same size as the WorkList array, we are able to find the first Task that has not been fully allocated.)
Formula for “Day”
Enter the following formula into cell A2, ending with Ctrl+Shift+Enter, as shown in the following diagram:
=IF(SUM(C$1:C1)>=SUMPRODUCT(WorkDuration), “…”, MAX( N(A1) + (SUMIFS(C$1:C1, A$1:A1, A1)>=MaxHrsPerDay), 1)) Ctrl+Shift+Enter
Let us look at the formula in detail (using the formula in row 2):
- SUMIFS(C$1:C1, A$1:A1, A1) -> This expression calculates the sum of all durations (in column C) where the Days (in column A) equal the previous day.
– In cell A2, this expression evaluates to “SUMIFS(“Allocated Duration”, “Day”, “Day”)” = 0. (Excel smartly ignores any non-numeric values in the first argument.)
– In cell A3, this expression evaluates to “SUMIFS({“Allocated Duration”;24}, {“Day”;1}, 1)” = 24.
- SUMIFS(C$1:C1, A$1:A1, A1)>=MaxHrsPerDay -> This expression checks if the sum of all durations where the Days equal the previous day is greater than or equal to MaxHrsPerDay.
– In cell A2, this expression evaluates to FALSE
– In cell A3, this expression evaluates to TRUE
- N(A1) -> This expression returns the numeric value for its argument. Since N() returns zero for any non-numeric arguments, we use this function to return zero for the heading (“Day”) in A1. (Any numeric values are returned as is.)
- MAX( N(A1) + (SUMIFS(C$1:C1, A$1:A1, A1)>=MaxHrsPerDay), 1) -> The first argument of the MAX function “N(A1) + (SUMIFS(C$1:C1, A$1:A1, A1)>=MaxHrsPerDay)”returns the next increment for day, if the previous day has been fully allocated. Otherwise, it returns the same value as the previous day.
– In cell A2, this expression evaluates to MAX( N(“Day”) + (SUMIFS(“Allocated Duration”, “Day”, “Day”)>=24), 1), which evaluates to MAX( N(“Day”) + (0>=24), 1), which evaluates to MAX( 0 + (FALSE), 1), which finally evaluates to 1.
– In cell A3, this expression evaluates to MAX( N(1) + (SUMIFS({“Allocated Duration”;24}, {“Day”;1}, 1)>=24), which evaluates to MAX( N(1) + (24>=24), 1), which evaluates to MAX( 1+ (TRUE), 1), which finally evaluates to 2 since 1 + TRUE = 2.
Download
You can download a copy of the above file and follow along, Download Here – Excel 2007-2013.
Final Thoughts
While we used the same basic logic for both output options in this article, there are probably many other ways to tackle the age-old problem of production scheduling.
I would love to hear about some of your ideas, as well as ways to extend the concepts described here.
In the meantime, I wish you continued EXCELlence!
Sajan.
Other Chandoo.org Posts related to Scheduling
Here at Chandoo.org you can find the following related posts:
http://www.chandoo.org/wp/2010/11/18/scheduling-variable-sources/
http://chandoo.org/wp/2009/06/16/gantt-charts-project-management/
http://chandoo.org/wp/project-management-templates/gantt-charts/
Thank You
This was Sajan’s second post at Chandoo.org and so a special thank you to Sajan for putting pen to paper to describe the technique here.
You may want to read Sajan’s first post here or thank him in the comments below:
Formula Forensics “The Series”
This is the 31st post in the Formula Forensics series.
You can learn more about how to pull Excel Formulas apart in the following posts: Formula Forensic Series
Formula Forensics Needs Your Help
I need more ideas for future Formula Forensics posts and so I need your help.
If you have a neat formula that you would like to share like above, try putting pen to paper and draft up a Post like Sajan has done above or;
If you have a formula that you would like explained, but don’t want to write a post, send it to Hui or Chandoo.






















64 Responses to “How do you make charts when you have lots of small values but few extremely large values? [Debate]”
A wonderful surprise to see a question I asked in your forum appear as an article in your website. Many thanks for helping me tackle this problem, and also for considering it something that merits an actual article. I look forward to reading the replies.
Cheers
Mike (now happy)
How about a dot plot with a logarithmic scale? That way you can see the outliers, and you are not visually comparing the length of the bars.
How about putting the outliers on the secondary axis? When I use this technique I'll often use a distinctive color for those columns and then draw a box over the secondary axis in the same color so that the reader associates, for example, green columns with the green axis on the right.
@Mike... You are welcome. It was an interesting question, and I wanted to see what our readers thought about it.
@Joe.. interesting idea. A dot plot or scatter plot is infact better as viewers do not measure the height to compare values.
As with all graphs, you need to know your audience. Taking logs is the preferred method for audiences that will understand what you are doing. However, I do not use bar graphs with log scales. I use graphs judged by position rather than length such as line graphs or dot plots. In Creating More Effective Graphs, I recommend either taking logs for more technical audiences or using complete scale breaks for others. By complete scale break I mean two separate panels; not little parallel lines that can easily be missed.
Here is how I would approach it:
http://a.imageshack.us/img405/823/largesmall.png
I usually don't like having multiple axis charts, but I sometimes use them to show charts with two colors for a series (for example with one color for positive and another for negative). I think the same principle can be used here. Then I used a custom format, color code, and a text box to hopefully clearly convey the message that the two data points are anomolies.
It seems to me that understanding the seasonal sales peak in Dec and Jan is the most important part of the chart... particularly in relationship to the 10 other months where nothing much happens. This firm must need to plan staffing very carefully. I would go with option one and add bar value labels (so you can see that there were sales in the early months of the year) plus a 2nd y-axis plot with a cumulative percentage curve starting at feb and going to jan). Logging this data series completely destroys the point of the chart. Both the log and Option 3 are way too geeky. Option 4 looks like the person who released the information just didn't care or didn't get the point that could have been made.
I agree with Naomi Robbins. Use a full break with dots instead of #3 or #4. Never use a break (full or partial) with bars because of how bars encode data. If your audience understands logarithms, use lines or dots and a log scale.
I am awaiting Microsoft to allow multiple Y axis indexes to use with a line chart.
@Tiffany I didn't notice your post before, but you basically summarized exactly how I approached it. :o) Scale is still a question when using the approach, but as long as it is clearly delineated, I think it makes sense.
Nice tricks Chandoo.
This reminds me of a recent situation that came to me where there were 50 x axis points (for various projects) so the chart had become too lengthy.
What can be done with such charts where there are too many labels and the distribution of chart does not come proper?
Working in the pharm industry this was often a common occurrence when looking at growth trends of new products on a quarterly or MAT basis. As senior management where generally interested in seeing both actual values and growth I tended to opt for your example four with growth charts. The growth was somewhat meaningless for these new products and their actual sales showed how well they were performing when looked at on a rolling quarter basis.
dont forget Line charts too for multiple series:
few series on lower-range values, but few on higher-range values.
I'd prefer to split them apart, but if you have 12 series or more, it'd be tedious to show mutiple charts on mgmt report (summarize in 1-2 slides).
@Nimesh
I would generally group them (eg North, Sth, East, West or whatever is appropriate for your business) and chart or only show the top 5-6 performers and then group the rest (Others)
Depends on the data. Right now I have a couple of regular charts with several large contributors and a handful of tiny contributors.
I've taken to top 5 values plus a 6th with 'other'.
Another problem with 2-axis diagrams is when you take B&W prints. It gets harder to interpret the chart when only one type of visualization is used (e.g. only bars not bar and line).
Log-scales are also problematic as we do not often encounter them, Say, you have 8-10 graphs in a report and in only one graph this large value problem is there. So if you use a log scale for this single graph, the graph itself become a outlier.
I would prefer breaking the Y-axis by hi-tech way (Pelter, Tushar Mehta) if I have time or do it by pasting two charts as suggested here.
Possibly, a macro can be written to place two charts together like this??
Naomi's suggestions are excellent: for audiences that are used to interpreting logarithmic axes, take the logarithm; for all other audiences, use a complete scale break, such that you have separate panels. This is, in fact, the advice offered by William Cleveland in "The Elements of Graphing Data," (pp 103 - 104).
I would also suggest using dot plots instead of bar graphs. Dot plots are like a sideways bar plot, using dots instead of bars. They provide the following advantages: more space for labels; more compact arrangement; clearly mark out the value being plotted, rather than the range from the axis to the value (less non-data ink). Here's a nice introduction: http://www.b-eye-network.com/view/2468.
Since a picture is worth a thousand words, I've created an example of a dot plot with a complete scale break:
http://yfrog.com/b9rplot2p
For the curious, this was done in R (http://www.r-project.org), with the following code:
month <- c("Feb 09", "Mar 09", "Apr 09", "May 09", "Jun 09", "Jul 09", "Aug 09", "Sep 09", "Oct 09", "Nov 09", "Dec 09", "Jan 10", "Feb 10")
sales <- c(200,300,200,300,200,300,350,400,450,1200,100000,85000,450)
SalesRange <- shingle(sales, intervals = rbind(c(0, 5000), c(5000, 200000)))
salesdata <- data.frame(month = month, sales = sales, order = c(1:13), SalesRange = SalesRange)
salesdata$month <- reorder(salesdata$month, rev(salesdata$order))
dotplot(month ~ sales | SalesRange, data = salesdata, strip = FALSE, layout = c(2, 1), levels.fos = 1:50, scales = list(x = "free"), between = list(x = 0.5), par.settings = list(layout.widths = list(panel = c(2, 1))), xlab = "Sales (US$ x 1000)")
Tom - Thanks for your kind words and for recommending my article. Thanks also for showing the complete scale break. I was not at my computer when I commented yesterday and was just going to draw a complete scale break when I saw that you already did it.
Hi Guys,
Lot's of tricky answers... But for my money the response saying "know your audience" is on the money.
What I usually suggest, and most times I provide both examples and let them make a decision:
1. multiple axis - only when it still looks ok (not off the chart so to speak)
2. Do a "% of total" graph, with a data table containing the actual values underneath the chart - remember Excel has other functions that do compliment each other. charts/ tables/ pivots etc
Thanks and good to see so many responses to this one.
Several people have suggested using double-Y axes (also called dual scaled axes). My experience is that readers sometimes miss the second axis, sometimes are confused by the second axis, and often place meaning where none exists (e.g., where lines cross). Double axes are especially dangerous when both represent the same variable (number of sales or dollars or whatever.) See Wainer's "Visual Revelations", http://amzn.to/cxKuMv, or my "Creating More Effective Graphs", http://amzn.to/9qc6Mq, for examples of double-Y axes that deceive. Stephen Few wrote an article that describes my feeling about dual axes as well. See his article at http://bit.ly/4ycwgZ.
Christian indirectly raises a good question: with such a small data set, is a graph really needed, or the best way of presenting the data? The table at the top of the page is more compact than any of the graphs presented and, arguably, easier to interpret. It certainly avoids the problem of presenting such highly skewed data on a graph.
Although I agree with Tom and Christian that a table might be easier to interpret than a graph for this particular data set, I think that the issues raised in this post go far beyond this one data set.
Hi,
I have a similar situation, but in this case with the x-axis. How do I represent a line chart of 3 Y-axis variables against the x-axis that has values with large disparity (0.001-100 mg/ml) ?
Thanks.
Excellent points everyone. Thank you for thoughtful discussion Naomi.
Few additional points:
(1) Log axis: Use it with caution. I have never really seen a log axis graph that is effective outside scientific publications. Your audience need to be in right mindset to digest logs.
(2) Chart vs. Tables: My intention in this post is to discuss how we can chart when data has wild ups and downs. The data set I used is almost meaningless. Of course, if I want to make a chart for the above data alone, I would rather go out and party as the sales skyrocketed during holidays.
(3) Secondary Axis: While this approach is creative, I would avoid it as it can be confusing. Usually secondary axis works best when you have 2 diff. series (sales vs. profits, market share vs. profits etc.) But if you have been using this approach and your readers already know it, then you are golden.
(4) Complete split vs. axis split: I agree with Naomi that it has to be a complete split. I was too lazy to work out the mechanics of complete split thru excel. (Good work by Tom showing how to get full split thru R). You should go thru Jon's tutorial on full axis split in excel here: http://peltiertech.com/Excel/Charts/BrokenYAxis.html
(5) Dot Plots: I think these are even better than bars. Alas, there is no native support for dot plots in excel and you have to do some bit of circus to get them right.
Again, thanks to Naomi, Tom, Christian, Rajdeep, Dan, Hui, Sanford, Nimesh, Mark, David, Bill, Michael, Jason Sanford and all others for sharing what you think about this issue. Your comments are a testimonial to what our community is all about... 🙂
[...] Highly Skewed Data Recently Chandoo.org posted a question about how to graph data when you have a lot of small values and a few larger [...]
Chandoo,
I created an Excel graph of how I think this exercise should look. Since a picture is worth at least a "crore" of words, I put a capture of it on-line at
http://img839.imageshack.us/i/chartswithsmalllargeval.png/
Which brings up an area that I hope you could help enlighten me on with a how to article... specifically, what is the best demonstrated practice for sharing stuff online (pictures and spreadsheets) when leaving non-text comments on blogs? For instance, I am having a lot of trouble getting Microsoft Skydive to work. Also, how do these web-based sharing areas control security (for instance, how can I create a "public" area and a "private" areas)?
bill
Bill
Have a read of this post for some File and Picture sharing sites
http://chandoo.org/forums/topic/posting-a-sample-workbook
Each method is different and some allow Private and Public areas.
Generally you only get Private areas when you pay.
Always remember to anonymise data, especially names or data if it is commercial in nature.
I just came across an old post on another blog that discusses this topic and might interest readers here. Some of you already commented on this post.
Sorry - I omitted the link. http://www.datadrivenconsulting.com/2009/12/charting-data-with-one-disproportionally-large-value/
A post on another blog that was written after this one discusses this topic and might interest readers here. Some of you already commented on this post. http://chandoo.org/wp/2010/08/20/charts-with-small-and-large-values/
Hello Chandoo,
I accidently landed on your website but I must say that it has changed my entire analysis skills.....of course, have made these skills much better. So first of all, I would like to thank you for all the information you have posted on your website.
I have one question: what should we do when we have huge difference in values? I mean I have the data for which has the values like 5000, 50000, 100000, 500000, 2000000, 10000000 and 50,000,000. I need to show all the values on one chart...what should I do? but several small values are not visible as there are several large values and several mid level values. I hope i am clear enough. Can you please help me?
I had a similar problem, but with a lot more data points. In the end, a historgram solved it for me.
@Gurpreet KANG
Is your audience suitable that you could use a log plot? If you data escalates exponentially like your example, might be a good use.
Other thoughts would be to have a xy graph with a line (but wound show near 0 early on), or change your numbers to a percentage increase/decrease from the previous?
As alreayd discussed, knowing what you want to present (actual values? change in sales? human population?) and what message you want to give are important things to consider.
I use a mac and have excel 2008. I have a huge database and need to find and remove duplicate entries
@Shailesh have a look at Google Refine. It's a free, downloadable framework for cleaning data that runs in a browser on any OS. One of its many capabilities is to find duplicate data, even when the data is not identical (i.e. it deals very well with typos).
i always use option 4,
1. my audience will have difficulty with log scale, or axes with different scale, this eliminate option2 & option micheal#6
2. cut/broken exist tamper with actual visualisation, this eliminates option 3
3. option 1 tampers with analysis of common trend vs offshoot data, with improvement by option bill#25, this is good, but it requires reading all the small values.
4. plot graph by tom#18, complicates matter. Instead of bars, they now see dots, which is not friendly to older people (no offense here), a line might help but it clutter the space for data labelling
So i use option 4, fix the y-axis for the majority data & label the offshoot data to let my audience's imagination do the visualisation.
best solution for me, but might be not for all of us... 😉
chandoo,
How about method #6 I learned from my my boss, scaling the large values X times. see it http://twitpic.com/4p2bsl What do you guys think?
TalKohlberg (#36), the label approach has all of the disadvantages of the split-axis method, with the added disadvantage that there is almost no visual cue that the longer bars should be visually distinct from the shorter ones. In Cleveland's terms, pattern perception and table lookup are completely at odds, while basic requirement for any graph is that the two operations need to compliment each other.
The only reason that I can see for making a graph like this is that it is easier to create in Excel than many of the alternatives.
xsaed (#35), you raise a good point that we have not yet considered: that the "normal" rules may not apply for all audiences (or mediums) due to visual acuity. There is a parallel here with typography: the normal rules for maximizing readability (e.g. using serif typefaces and certain type sizes) are modified in print for the visually impaired. Though there are few good studies, the rules may also need to be modified for on-screen displays, where the resolution is ten times lower than in print.
It would take some careful studies and consideration to identify the best solution for low-resolution displays and the visually impaired.
Tom (#37), Thank You for taking time to comment. There is no split axis in this method # 6, the scaling approach. I do see what you mean about distinguishing the scaled bars so I made scaled bars darker and legend indicating scaling factors in red, see it here http://twitpic.com/4qs8iy Is it easier to see now?
TalKohlberg (#39) It's always fun to discuss such matters and see new approaches; thank you for the opportunity.
Highlighting the three bars helps to bring the viewers attention to them, but does not fix the underlying problems with the graph design. I do like, however, that you used shades of color rather than different colors to do the highlighting; people often forget that a sizable portion of their audience will be unable to see distinctions based only on color.
When a person reads a data graph, they are performing two distinct operations: what Cleveland calls "pattern perception" and "table look-up." Pattern perception happens first and is an interpretation of the geometric arrangement of the graph (relative distance, size, position, etc.). So pattern perception tells us that the third bar from the right is biggest; the smallest bars cluster on the left half of the graph; some bars are darker than others and there's some extra labels over them; etc. Table look-up is where the user reads the labels and axes to gain a quantitative understanding of the data underlying the graphical display. Here is where we learn that the third bar from the right is "100x" and the short bars on the left are in the range of 200 - 300.
For a graph to be effective, pattern perception and table look-up have to tell the same story; they have to be fully complimentary. In addition, a graph should only show those elements that are needed to support these two operations. Extra labels or fancy elements ("chart junk") degrade the reader's ability to correctly perceive the patterns in the graph and to read the critical labels.
So, applying the above model to your graph, we have the problem that pattern perception tells a very different story than table look-up. To overcome this problem, you are adding extra labels to assist table look-up and highlighting some bars to try to draw the reader's attention to the significance of those extra labels.
A better solution would be to have a graph where pattern perception and table look-up are fully complimentary, instead of conflicting. For this reason, an axis break is actually better, and a full axis break is better still, because a full axis break is more obvious for pattern perception than the more subtle axis break. See my comment #18 for an example of the full axis break. Log scales are better still, if your audience has the skill to correctly interpret them.
Finding the best graph for an audience is a difficult problem, and there is no one-size-fits all solution, as the lengthy discussion on this post has demonstrated. We can only learn, try out new ideas and improve.
The labeling approach suffers from a further limitation: it doesn't scale well. For such a small data set, it's fine, but if we have dozens of extreme values, it's a lot of work for you to construct the graph and table-look-up and correct interpretation become a real burden for the reader.
If you are interested in learning more, I highly recommend Edward Tufte's book, The Visual Display of Quantitative Information, and William Cleveland's more technical book, The Elements of Graphing Data.
Tom (#40) Thanks for the dialogue. It looks like it is just you & I on this topic now. On balance, Chandoo and other posters may still find this scaling approach http://twitpic.com/4qs8iy as worthwhile method# 6. It is there for all to judge.
Hi Chandoo,
well, i'm from morocco, every day, I learn new tips. Keep rock, you inspired me
[...] The Learning As You Go blog has a nice article about plotting highly skewed data, at Graphing Highly Skewed Data. The article covers use of secondary axes (a bad idea), breaks in the axis (also a bad idea), logarithmic axis scales (okay if users understand log scales), and multiple charts. This article is a response to a discussion started by Chandoo in How do you make charts when you have lots of small values but few extremely large values? [...]
[...] ♥ Smart chart labels♥ How to make charts when you have too much data [...]
I like the trick of creating two charts.
This is one of the best places for learning about XL.This site has improved me a lot and helped me getting promotion in my job.Thanks a lot to chandoo.
I'd make two charts and sit them side by side and clearly state why you've had to do this. I'd never split the axis in case it misleads.
TalKohlberg (36, 39, 41)
I think your method of scaling some bars is very dangerous. You are hoping people are able to let their conscious mind (a) notice the scaling factors, (b) do the math, and (c) override their preconscious interpretation of the plotted points. You'd be better served with a simple table than with a chart that doesn't show precognitively what it shows "tabularly". These are Cleveland's terms “pattern perception” and “table look-up” (thanks for the Cleveland terms, Tom).
I covered this in my blog recently, in Broken Y Axis in an Excel Chart. Contrary to what the title implies, I advise not to break an axis, but to show the data in two separate panels.
i think you could also represent the ten months of non-interesting values below a certain threshold as a single bar (size) and of course label those as such. This single size for the uninteresting low sales is a nominal height that is big enough to see but, pales into significance compared to the two big months sales, and still reads ok as a constant axis scale chart. Surely there's not much point in showing the exact values on the chart for the minute differences during slow sales months?
Bhai thanks a lot.
My problem is solved . thaks once again
TQ it is very helpful for beginner likes me
In my industry we come upon this often as a very large percent of our revenue comes from sales that occur in the same month.
I've often found that what people really want to know is "How well are we doing?", and that this question can be answered just as well by using data comparisons that eliminate the problems with large outliers.
For example:
1) If Sales are seasonal, show Sales as a percent of the Sales for the same month of the previous year. If your business is selling air conditioners, it's likely everyone already knows your company sells more of them in the summer. This approach can help your audience understand if Sales are improving relative to a comparable prior period.
2) Otherwise, show the YTD Sales as a percent of the YTD or annual Sales goal (this is similar to the Project Management concept of a "burndown chart"). Again, another way to let people know whether or not Sales are "on track".
I would agree with Jon that if the raw Sales data are required, showing the data as a table is far less likely to confuse your audience. However, the approaches above may help provide a quick visual impression for those in your audience who are not "numbers" people.
Good day, everyone!
I am new to charting and I am currently having trouble presenting Revenue to Expense Ratio (R:E) and number of accounts in one chart. I have 138 institutional clients to compare across. R:E ranges from 0.01 to 806.48 while number of accounts ranges from 1 to 3,280.
Appreciate your idea please on how to effectively and efficiently chart these two metrics across the 138 institutions.
Thank you in advance.
@Dyan
Obviously there isn't an easy way to compare 138 customers across such a large range of ratios
What about:
Showing the top 10 or 20 in one chart
Showing the bottom 10 or 20 in one chart
Showing the top 10 or 20 by revenue in one chart
Showing the bottom 10 or 20 by revenue in one chart
Can they be grouped or filtered in Zones or Regions or Divisions or States etc
Hi, Hui.
I really appreciate that you took time to answer my query.
I will heed your advice and go for the last reco on grouping my data and have a separate chart for the groups.
Thank you so much! =)
Dyan
Hi Chandoo
That was awesome! Thank you
Thank you so much for this, very helpful
Great discussion! I think I like Jon's panel charts best as well as the bottom 2 magazine style charts on the Data Driven website. My use case is for a dashboard full of the same style charts, so while these are great, I can't have 2 different formats on my page, and quite frankly, they're just too big!
I tried the approach to add a data table to my charts and found out I was in for a new set of headaches. I seem to remember Jon having lots of examples of labeling magic, and I stumbled on what may be my answer: http://peltiertech.com/Excel/ChartsHowTo/OutlierLabels.html
This is very similar to Chandoo's #4, but with Excel handling the dirty work in a very elegant way. For dashboard use where the chart needs to be small, I think this may be the best option.
Here's my attempt:
http://chandoo.org/forum/attachments/compliance-chart-v2-jpg.3585/
Ok this post is 4 years old but here is a great technique once again from the Frankens team, which offers another solution
https://sites.google.com/site/e90e50fx/home/broken-line-panel-charts
Hui...
Team Chandoo,
Nicely put examples. Helped me in one of my reports. Thanks a lot!
Very useful.
Thanks