This post is from GuestBuster Jeff Weir in our Chart Busters series.
Note: The post slightly longer, but worth every word. Just get a cup of coffee and soak in to this visualization goodness. (Also, click on any image to see its full version)
Over at the FlowingData blog, they’ve been talking about this pretty slick looking Choropleth Map that shows how Medicare returns vary across the United States:
The above shows total Medicare reimbursements in 2006, either by Hospital Referring Region or by State, depending on the radio button. Using the dropdown box, you can change it to this:
…which is how the data looks if you overlay it on a Giraffe. Oops, I forgot to rotate it before saying that. Bear with me a moment…
There. See the Giraffe now? Good.
A picture is worth a thousand words, or so they say. But is a Choropleth worth the many line charts and clowns that you could squeeze into the same valuable screen real estate? Let’s find out, by evaluating what this particular chart does well, and what it does poorly, and whether other charting methods might better convey its information.
Words and music.
Right off the bat, there’s a simple way that the authors could improve this chart. While they include a description below the chart to point out what the data is, and where it came from, they miss something just as important…what they concluded from all of this. So before we consider adding – say – bullet graphs, let’s consider adding some bullet points. A few sentences can tell readers important stuff that would otherwise remain hidden in an undownloaded PDF report. Insights like:
- Care is often better in low-cost areas.
- Growth in returns are only partly explained by advancing technology, and
- Differences in growth rates across regions seem largely due to discretionary decisions by physicians that are influenced by the local availability of hospital beds, imaging centres and other resources-and a payment system that rewards growth and higher utilization.
Straight off the bat, this would make the graph a better graph…without even messing with its form.
But mess we must…
…because lurking below the chlorophyll green of this Choropleth Map are a few serious charting oversights. Ready? Let’s check ’em out.
Scale? Fail!
First, check out the legend.
Crikey…its bands are as discrete as Bruno. Its scale is about as even as my temperament. It varies about as much as =RANDBETWEEN(PaydayBankBalance, UsualOverdraft).
If you fire up Excel and look at the spread covered by each range, you see just how arbitrary the different price bands are:
Whoa…the spread of that $9k to $16k band is nearly 15 times larger than two of the other bands. That can’t be good, can it?
Nice profile
If you were to graph financial spread of each group against the aggregated number of Hospital Referral Regions that fall within each spread, you get something like a histogram. The difference between the sizes of these bands is about as different as the number of performers on stage at a Bob Dylan concert in 1964 compared with 1974. See for yourself:
Oops, wrong graphic. Try this:
Normally histograms have equal widths for each band, but here I want to highlight just how unequal the bands used are. Plus, this lets us regroup the data into evenly spread $1k bands, and overlay it on the first distribution, to see how it compares. Here’s one that I prepared earlier, with the red line as the regrouped data…
Vastly different picture isn’t it. The red is kinda like Data Pig’s heart rate before he eats chocolate covered bacon on Saturdays, and the blue is how his ECG would look when he’s in the ambulance, on the way to the hospital.
This makes it very hard to answer that important question “…compared to what?” With such different sized bands, how can we compare one to another? How can we be sure that the distributions within each band will even allow us to?
For instance, take the highest band spread of $9k to $16k: without any further information to go on, we might assume that the median (i.e. middle) value for districts in this category is midway between the $9k to $16k boundaries, like this:
But that’s like assuming that Simon and Chartjunkle (oops, Garfunkel) have equal talent. We’d be wrong. Verywrong. In actual fact, there’s only three data points to the right of our guessed median line. And as for the 55 hospital regions in Group Five that fall to the left of it…well, they all get tarred with the same brush those worst three performers. The actual median for this group is a lot further left, as shown below:
This means that over half the data in this 5th band actually falls much closer to the far left of the graph than to the far right of the same group it’s been placed in.
You can see this better if you add a one-dimensional strip plot above the graph, which gives an idea of where the 300 odd values fall within the entire range:
Whoa…looks like we’ve got a few outliers to contend with.
What a State we’re in…
This seemingly arbitrary ‘bucketing’ effect is exacerbated when aggregating the different hospital regions into State-wide totals. Except this time regions are being penalised by arbitrary geographical boundaries, as well as the arbitrary financial ones above.
Take Texas for example. Aggregating everything up to the State level, Texas appears in that highest band. Yet at the Hospital Referral Region level, one third of its 22 different hospital fall below the national average, and the median for the whole State is around $8,800. So we better be careful making assumptions from a State-wide view, because the Choropleth averages some very diverse costs over some very large chunks of real estate.
To see just how diverse, let’s rank the entire US values from smallest to largest, and highlight where the Texas readings fall within that range:
What can we tell from this? Firstly, nearly all regions nationwide fall between $5k and $10k. Secondly, there are a few outliers that really skew the picture at the high end. Thirdly, in the Texas case, the State average is boosted somewhat by 3 Texan districts that happen to be among the worst 10 culprits nationwide – one of which is clearly an outlier at $15k. Unfortunately for the lower cost Texan regions, they’re guilty by geographical association…kinda like being kidnapped and held for a zillion dollar ransom, just because you happen to live in the same State as Bill Gates.
So what do we get by aggregating to State boundaries? Probably more blurring than insight. After all, what good would a weather report be to Texans if it only reported the average weather they could expect as a State! Instead, it’s better to keep the aggregation at the Hospital Referral Region level. That way, we can look at this:
…and ask things like “Wow, why such a difference between Waco and the surrounding bits of Texas?” and “What the hell is Alaska doing there?”
Legends in the making…
What’s far worse that this though is that when looking at the State-wide map, the legend is now really, really wrong.
Here’s the legend next to the actual State-wide figures, for comparison:
Whoops…the graph title has changed to reflect we’re now looking at Medicare spending per beneficiary per State; i.e. State averages. The legend is still looking at Hospital Referral Region averages, which have a much greater spread. For instance, the Choropleth shows six States as being dark green regions, and the legend says they fall somewhere within $9k to $16k. But the actual data shows they fall in a $9.4k to $9.6k range. Oops! Slight misrepresentation, there.
How to fix it
Obviously this graph really should use a quantitative scale with equal increments; one that changes to reflect the selection that users make. What’s more, colors should have just enough variation so as to highlight any important differences, without being overwhelming or mistaken for camouflage.
But is a Choropleth Map the best way to present this data in the first place? If you want something for people to play with online, then maybe…but if you want to compare things very closely to other things, then maybe not.
For sure, a Choropleth Map looks cool, and it has what Tusha Metha calls “natural context”. But from an analytical perspective, a Choropleth only really reports how one thing changes with regards to geography. If geography is a major determinant – or if you want to show people how things look in their own back yard compared to others – then perhaps this is the piece of kit you need. But if there’s other factors that have much more sway on your data than geography, then perhaps not. For instance, we might want to see whether population density plays a significant part in Medicare returns, given the likely economies of scale from providing healthcare to densely populated regions vs. urban regions. Nows the time to break out a scatter plot:
Hmmm…looks promising. (Note: I’ve used State-wide data for the above…ran out of time to track down densities in the different Hospital Referral Regions, which is what I’d prefer to do.)
Or we might want to zoom in on the best or worst offenders, and see just how different they are to each other, and to the median value:
Conclusion
I think a better, fairer Choropleth Map at the Hospital Referral Region level would be interesting. But I don’t think it would be enough. To quote from Stephen Few’s latest book Now you see it: “Color is good at drawing your attention to something if used sparingly, but is one of the ‘pre-attentive attributes’ that is not quantitatively perceived in and of themselves”.
Whereas lines and 2D precision are very precise ways to encode quantitative values.
So when it comes to answering the ‘Compared to what’ question, I don’t think you can beat this:
Choropleth Maps in Excel
For information on the implementation of Choropleth Maps in Excel, check out Tushar Mehta’s excellent resources.
For more information on the pros and cons of Choropleth Maps, check out the Clearly and Simply blog, where Robert has built on Tushar’s excellent approach to produces some great downloadable templates. He also offers advice on potential drawbacks of Choropleth Maps, such as:
- No visualization of development over time
- No information on exact values (unless you are implementing tooltips including the data)
- Very limited direct comparability of the regions
- Possible perception problems with regards to the size of regions (e.g. Rhode Island on a US map)
- Possible misinterpretation because the size of a region may have a greater impact on the user’s visual perception than the intensity of the fill color
- Requirement of real estate on a dashboard
His recommendation: carefully consider whether or not a Choropleth Map is the best visualization for your purposes. Check out his dashboard of Lithuania at a glance to see how he mitigates some of the potential problems by incorporating other graphs into the display.
I used Robert’s template to produce this State-wide Choropleth Map of total Medicare spending per enrollee, 2006 using the same Medicare ranges as the Choropleth that’s the subject of this post:
…
…then I replotted the graph using data that had been regrouped $1k bands:
While I don’t advocate this approach, it’s interesting that even though this is aggregated to State-wide totals, you can see significant differences between the graphs.
Right, that’s it. I’m off to the Hospital to see someone about my writers cramp…
About the Author
Jeff is a Business Analyst from Wellington, New Zealand who has recently discovered a strong interest in Data Visualization. He swears by Edward Tufte and Stephen Few as much as he swears at Excel 2007. He’s so new to advanced Excel, that 2 years ago he had to ask a work friend what the dollar signs in $A$1 meant. Now that he knows that, he’s trying to find out what the dollar signs in $A$2 mean.
Note from PHD:
Thank you Jeff. Your passion and knowledge is truly outstanding. I have a whole pack of donuts waiting for you.


































43 Responses to “Quickly convert numbers stored as text [tip]”
Additional tip,
Select column which contains text -> Data -> Data tools -> Text to columns -> Finish
Chandu
This one is particularly awesome if you have tens of thousands of data to convert to number. Otherwise it can take excel forever (minutes to half hour or longer) to process an error correction.
This is definitely the best option, and has the added benefit that you can use it to convert text to numbers, and numbers to text, depending on whether you choose General or Text before clicking Finish!
when you use this method it's worth making sure that there are no delimiters selected (just in case)
another method is to do a find & replace (I routinely use zero with zero, or . with .)
ALT+A+E>>enter>>enter will do it 🙂
OMG Thank you Chandu! I was struggling with this so much with my big datasheets and now I am so happy! so funny.
In my excel though, it is the same pattern but it is:
1) Choose the data tab
2) highight the column you want to change
3) choose "text to columns" in the ribbon
4) select fixed width
5) enter
6) no crashes!!
Often the text which you want as a number will have a decimal point, so Select column which contains text -> Data -> Data tools -> Text to columns ->Select apostrophe to add also as the delimiter -> Goto Advanced and add the decimal point. -> Finish, Voila. It works. You can then format as currency etc. Worked in Ver 2013. Seems that MS is degrading some important functions so as to get users to upgrade to 365.
Sir, how convert text to number in Power Query.
I had never thought about multiplying the numbers by 1 before. Great tip. For those who love macros, I found a very well written VBA macro by Ejaz Ahmed (StrugglingToExcel.com). This macro not only converts the numerical text to numbers but also formats dates and trims the values (getting rid of those nasty leading/trailing spaces). Plus you can apply this to multiple columns at the same time! I immediately added it to my QAT bar and use it almost daily with my data extracts. Check it out!
http://www.thespreadsheetguru.com/the-code-vault/2014/8/21/convert-numbers-stored-as-text
This tip is awesome! But one thing I run into constantly is the need to convert text to number and keep the leading zero, if there is one. I work a lot with SSNs and zip codes, etc. Any help, much appreciated!!
Hi Sue,
let ur zip code (length 5) in Column B, then select Column B and go to 'Format Cell' (CTRL + !) - Number - Custom - enter 00000 in Type field.
Now put, 15 in cell B1 and it will show 00015.
Hope this will solve ur query.
1) SSNs and ZIP codes are not numeric. They are meant to be character based identifiers. With numbers, leading zeros to the left of the decimal are not significant and are truncated. It may sound terribly picky of me to bring up the distinction, but I've learned that it does make a difference in some cases. (Especially when delivering to a client who is attempting to extract and then load your data into a different DBMS.)
2) The option described by SAURABH below (custom format, 00000) will work in Excel but it's only displaying the number as '00015' while the actual value of the cell will still be 15 because you have converted it to a number and excel will pay attention to significant digits (see above.) Meaning, if you "Paste special" with values only into a new cell, it will paste '15' into the cell rather than '00015', which could lead to problems depending upon how you need to carry them into new work. Your client's ETL process may bring in '15' rather than what you intend '00015.' I usually leave SSNs and ZIP as text, that way leading zeros (and dashes in the case of SSNs) are preserved.
Better than #2 (don't waste time dirtying and clearing a cell)... Copy a blank cell (really blank, not containing anything), select cells to change, Paste Special Values, Operation Add.
+1. Excellent tip.
An alternative to multiplying the numbers by 1 is to add 0 instead by using the same process as the multiplication method. At least you can save a step. You don't have to enter a 1 to multiply. The blank cell is 0 and can be added to change text to a number.
One thing I find very handy when doing this, as I often have intermittent numbers as text:
Select the first cell that has the warning flag, then ctrl-down arrow to the last one. Once you have it all, especially if it's thousands of cells, it's annoying to scroll back up to get to the flag. If you apply formatting, such as setting the cell fill color to none, it automatically takes you back up to the top without losing your selection.
Good Tips !
I knew both the tips before. I like the tip mentioned by "Jon Peltier". Amazing + Awesome 🙂 He steals the Show .. I mean this Post 🙂 😉
Thanks for contribution to all !
Regards,
Rahim Zulfiqar Ali
Love this post. I often have to export SAP Reports to excel and then do various sorts and lookups. This text issue has been driving me crazy. I especially like Jon Peltiers method where we can add a blank cell via paste special. Will be forwarding this tip to my friend Michael Martin.
yes, Michael Martin was impressed!
I also benefitted from Ctrl Alt V to paste special. I don't know how I have missed this one all these years. This is something we do a lot around here.
Hi again! One reason I asked about the leading zero is not so much for display purposes. I realize that '15' formatted to '00015' still has a value of '15'. That is one of my issues!! I receive data from multiple sources and need to constantly do look-ups and queries, and I drive myself crazy trying to format the different spreadsheets so they can 'talk' to each other. Does anyone have a tried and true process for syncing up columns where numbers are stored as text? For me, it's SSN, personel number, and zip. And, I think some of the sources they ARE number and some (like our DB queries) come in as text. aaargh. And, thank you in advance.
Ah, I see. I'd format anything ZIP or SSN related to text, and then clean up my lookup tables to be in that format as well. To convert those '15's back to '00015's there are two ways that I use.
1) If I'm reformatting an entire workbook, I actually DO use the custom format to adjust some columns, like zip codes or dates in an unusual format (like '21 Aug 14', etc.) when I have everything like it needs to be, I save the result as a .CSV file. When excel saves a custom format as .CSV, it defaults to value (the display option) and discards value2 (the underlying actual value.) Text files have no record of formats, they are lean by their nature, so when you open it, Excel will attempt to interpret each column. Because of this, I close the CSV, and I use the text import wizard, (Data Tab > "From Text") to bring in the CSV. The import wizard lets me specify which columns I want to be Characters and which I want to be numbers, (and it does dates as well for good measure.)
2) If I'm only able to reformat a single column, I'll usually do that via a Macro, for example, if the zip codes have had their zeros truncated by excel, I format that column as text, and run this "padding" macro that I wrote for that purpose:
Sub Pad_to_X()
On Error Resume Next
'This will insert zeros in front of a number.
'X is the length of the entire number plus zeros
'so if you have 1 and want 001, X would be 3
With Application
.DisplayAlerts = False
End With
x = InputBox("Enter X, and the selection will be padded with leading zeros to X characters")
For Each cell In Selection
lenc = Len(cell)
diff = x - lenc
If diff > 0 Then
padme = Empty
For nn = 1 To diff
padme = padme & "0"
Next nn
cell.Value = padme & cell.Value
End If
Next cell
With Application
.DisplayAlerts = True
End With
End Sub
Cleaning data from multiple sources is fun. Often a lookup will bomb because the table or the dataset has characters that mean something to the client's software (or webpage,) but are "invisible" when you look at them in excel. In those cases, I recommend trimming leading and trailing whitespace, and looking for and removing chr(160) (HTML Non-Breaking whitespace.) Those are the most common.
Sometimes the character in a cell is something you've never considered, and in the cases where a value isn't found in a lookup table, I'll run this "decode string" macro to split out the cell value and display it by its ASCII equivalent. It'll identify any weird relics, which you can then sweep for with a cleaning macro:
Sub Decode_String()
Dim sttrarray(1 To 5000) As Variant
Dim sttrarray2(1 To 5000) As Variant
'instring = InputBox("String to Decode?")
instring = Selection.Value
lenstring = Len(instring)
count = 0
For x = 1 To lenstring
sttrarray(x) = Asc(Mid(instring, x, 1))
sttrarray2(x) = Mid(instring, x, 1)
count = count + 1
Next x
Workbooks.Add
targ = ActiveWorkbook.Name
sht = ActiveSheet.Name
Workbooks(targ).Sheets(sht).Range("A1").Value = "Position"
Workbooks(targ).Sheets(sht).Range("B1").Value = "Character"
Workbooks(targ).Sheets(sht).Range("C1").Value = "ASCII decode"
For n = 1 To count '
Outp = Outp & "Position " & n & " is " & sttrarray2(n) & " or chr " & sttrarray(n) & Chr(13)
Workbooks(targ).Sheets(sht).Range("A" & n + 1).Value = n
Workbooks(targ).Sheets(sht).Range("B" & n + 1).Value = sttrarray2(n)
Workbooks(targ).Sheets(sht).Range("C" & n + 1).Value = sttrarray(n)
Next n
End Sub
Multiply 1, divided by 1, add or subtract 0 all do the trcik... ;p
Select the column and hit ALT+D+E and Finish till dialogue box disappears. We are good to go
Instead of typing 1 and copying it, just copy any blank cell and go to paste special and add.
It happens that I need to attach some data from external source over and over again and the data comes in text format. If these procedures are too difficult, I use extra column to have values in numbers. The formula is very simple: =value()
You go from "sometimes text-numbers may be scattered across the worksheet, making selection of cells a pain." to "3. Select all the cells that have text-numbers."
You just said it's a pain to select the cells... so you instruct us to do it anyway? Am I the only one who fails to see logic here?
@Belgianbrain:
You dont have to individually select such cells. you can select entire range that contains such data and do it. That is what I mean by "Select all the cells"
Good Tips! Thanks for sharing, very useful when SAP reports must be exported
Put zero in a cell >> Copy >> Paste Special >> Add
...will also do that.
Awasome... Chandooo..
But if data is too large u can use the Function =Value( & Then Use Paste Special.
Else this trick is superb.
Love this website!!!!
The "enter 0", copy - paste special - add method puts a zero in blank cells.
Similarly, for the enter 1- copy paste special multiply.
The copy blank - copy - paste special - add method does not seem to have this drawback.
Thank you.
If you do use the first option (click on 'Convert to number') and you're working in a large model, make sure you turn on the Manual calculation mode. Otherwise Excel will recalculate after each converted number. This can be really annoying if a model takes 0.1 seconds to recalculate and you just told Excel to convert a couple of thousand numbers!
The Value() formula works fine as well, which is good for external data, or to be used in LOOKUP functions. The other way around, by the way, if you must LOOKUP a value and you're looking it up in an array of text values, I use =TEXT(A1,"0") to convert a value to its string equivalent.
I also love the Add 0 tip, I hadn't thought of that!
[…] If you import data that has numbers formatted as text, Chandoo shares a quick tip for fixing them. […]
Really a useful tip,for me my users use SAP as input where often tedious to select all the cells and again go to the top row to convert numbsers.
Really a very useful tip.
To those keyboard people:
To access this "error handling menu", press the Alt + Right Click Button, then press the "C" to convert to number... this last "C" is in portuguese, I dunno what the equivalent in english is.
Also another thing I learned together with this trick. If you are using one of these new keyboard, that doesn't have a Right Click button, the equivalent to it is Shift + F10... so it would be Alt + Shift + F10 then "C"
This tip (Tip#2: Paste Special Convert) saved my life! I've always done all my conversions the 1st way. Working across multiple sheets exported from my Accounting program, all the zeros are always listed as text. It was so frustrating - and then this tip came along! Thank you thank you thank you! I now just select all sheets, do the ctr+alt+v thing and voila! all text is now zeros!
Very very useful tip 🙂
Chandoo Tu Bhai hai
Thanks a lot
Hi All,
I have few data with month as column name and Planned hours, forecast for months,actuals hours , ETC as Row data. What I want is whenever the user enter values in forecast month(Current month) , it shld color the cells.and when the user enter the values in Actuals hours(it will be of prevous month) it shld fill the color.The cycle will continue . Also As the month will pass on the previous valued cells shld be in no format.
https://www.youtube.com/watch?v=IbVwhrfegcI
My excel crashes to the point it is unusable. The only method I have found that wont crash my machine is find & replace (replace 1 with 1, and so on.) I am not sure what causes this in my sheets, and it can even cause another machine to start having this issue. I assume it is settings or the data itself. I created a dump file off the process before crashing, which ended up being huge. Looking in the file with notepad, some data I can't see. But towards the bottom of the file there is a ton of words - waiting on word wrap to finish in it to see what it says.
@aly
a few things
What type of PC is it?
How much RAM?
What version of Windows/Excel are you using ?
Do you know how many lines of data you have ?
Can you ask the question in the Chandoo.org Forums?
https://chandoo.org/forum/
Please attach the data file so we can give you more specific help