Cleaning Up Imported Data – A Recent Case Study
Yesterday in Formula Forensics 008 we looked at Elkhans MaxIf problem.
However the solution/formula as presented is the final solution to his problem.
Elkhans original worksheet contained other problems and today we will look at this:
I have attached the orginal file as a sample file you can Download Here.
You will see that the MaxIf cell F13 is returning 0, where it should be showing 0.246
Houston, We’ve Had a Problem!
Cell F13 has the same formula we looked at in yesterday’s Formula Forensics: =MAX(IF((Parameter_3=D13)*(Parameter_4=E13),Parameter_5,0))
A quick check of the formula reveals that everything was technically right with the formula, yet the answer is wrong?
To solve this I tried several steps which is the topic of this post:
Examine the logic of the If’s Criteria
The formula =MAX(IF((Parameter_3=D13)*(Parameter_4=E13),Parameter_5,0)) works by calculating the maximum value from the If array.
So step 1 was to look at the logic in the If’s Criteria
That is (Parameter_3=D13)*(Parameter_4=E13)
In cell F15 I entered = (Parameter_3=D13)*(Parameter_4=E13) followed by F9
Excel returns: = {0;0;0;0;0;0;0;0;0;0;0}
This tells me that none of the Cells match the criteria, strange?
Yet manually I can see 4 matching records, below:
Check Cell Length
The next quick step was to look at the length of the text in each cell.
In Column I, I added a =Len(E2) and copied down, there was only 2 characters in each cell, this step eliminated leading or trailing spaces.
Retype the Data
Elkham supplied the source data in an Excel file.
But the Criteria was manually typed by me.
So the next step was to retype some of the original data in Cell E2
Wow an Answer, So obviously there was a difference?
What is Wrong Here?
So obviously there was a difference between the C1 in cell E2 and the C1 in cell E13?
I checked this in 3 ways
1. Type the value “C1″ into Cell E2, without the quotes
This returned an answer 0.08 from F2 as it should have.
2. Copy an old “C1″ value to E13
This resulted in the correct answer of 0.246 in F13
3. Use a quick Formula
Entering a quick formula
In F17 type =E2=E13
Excel returns False
Showing that the value of cell E2 does not match E13
So what is in E2:E12 ?
As I had typed the values into the Criteria Cells D13:E13, I knew what they were, they were a plain and simple “C1″
So what was in E2:E12 ?
Next step was to look at the Ascii values of the 2 characters in Column E.
In K2: =Code(Left(E2,1))
In L2: =Code(Right(E2,1))
Copy both down to Row 13
Bingo !
The Character C in cell E2 wasn’t the same as the Character C in E13 ?
Yet both cells contained a Calibri Font.
If I now type in a spare cells:
F18 =Char(63), Excel displays a “?”
F19 =Char(67), Excel displays a “C”
Yet Cell E2 is clearly displaying C1 with a First Character Ascii code of 63 which should be a ? mark.
I suspected that it had been copied and pasted from MS Access, So I shot an email back to Elkhan, asking “What the source of the data was?”.
The response came back that “The data had been copied from a Russian (Cyrillic) version of an MS Word File and pasted into an English version of Excel.”
I can’t explain what has happened but somehow the character sets and associated values got scrambled when copied the data from the Russian Word Document into Excel
If you have had experiences like this or can explain what has happened please do so in the comments below:
Solution
The Solution was now easy
Use Search/Replace
Copy the contents of cell E1,
goto Search/Replace or Ctrl H
Find: Paste the contents of Cell E1
Replace with: C1
Conclusions:
 Be careful when receiving data from foreign language files, including word and Excel files
 Check summations based on such data to ensure its integrity
 Be methodical in tracking down problem cells
Lets us know about your Data Transfer Nightmares
Have you had any strange data transfer issues?
Let us know in the comments below.
 
 

Leave a Reply
Formula Forensics No. 008 – Elkhan’s MaxIf  Excel Links – Live from Bangkok Edition 
31 Responses to “Cleaning Up Imported Data – A Recent Case Study”
My first thought on seeing the data was that it was imported as text from some source, and the centering alignment of the cells in the table masked this. You should use default alignment so you can tell by eye whether any numbers (right aligned) are actually saved as text (left aligned).
@Jon
The data area is, as I received it
I can’t comment on what happened after he original copy/paste
@Hui
Here are a couple of web sites that discuss Excel’s handling of Unicode characters:
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=ExcelUnicodeData
http://www.dailydoseofexcel.com/archives/2010/11/19/unicodeandvbaschrwandascwfunctions/
HTH
I’ve had a problem before with data imported from a business database. When I tried to use the vlookups I had always used with this download, many were coming up with errors(#N/A). For some reason the spaces between the text were a code 160 ‘non break space’ instead of code 32 ‘regular space’. I used the code formula evaluate each character in the cells bringing back errors.
Not an import issue, per se, but I have had someone type numbers into a column with some of the cells formatted as Text. That was fun.
Any idea why Excel displays a “?” as “C” in Calibri? How is that possible?
I too have had problems with data brought in from a database when I tried to use vLookup. I would do a vlookup on an id field and it would return the wrong value. What I thought was happening was that it was bringing in the id as a text and comparing it against a number. However, I found that in some cases even when I converted both to text/text or number/number, it would still sporadically give me the wrong lookup value. I didn’t have time to truly troubleshoot the problem so I stopped trying to use formulas to match the cells.
@Juanito:
There are two tables to look at, the standards ascii set: http://www.asciitable.com/ and the Cyrillic set: http://www.asciicodes.com/cp855.html
At a guess, someone put in a Cyrillic 228, which looks like a capital C. There isn’t a standard ASCII equivalent, so it was converted to a “?” in the transfer.
I have used lookup on a large stock export spreadsheet that had stock codes with wildcards in (per the actual) in the middle of the a large data set. Data came out fine but you cannot search or refer to any cell.
It ages to find the problem that vlookups and search functions don’t like lookup’s with wildcards anywhere in the data.
Oh and also the preceding space or zero’s in numbers from imports that also does not help matters!
Why mess around with formulas when you can use a pivot table?
While a pivottable would make such a task easier, using this formula in a dashboard would be perfectly reasonable.
Another reason is that you can do calculations on the worksheet using this method that you can’t do in a pivot table. You can have formulas side by side that give aggregates for different groups of records, even combining the different groups into one formula (I typed an example of that in the comments below comparing male and female salaries).
As far as I can tell, Asa, all of that could still be achieved with a pivot table.
I’m not a pivot table expert. But in terms of comparing different groups of records – you can do that on a single pivot table; but only I think if the groups don’t overlap (they can be described in a hierarchy or perhaps the intersection of two hierarchies). My salary example could go in a pivot table as far as comparing male and female salaries by position. But I couldn’t see how to calculate the difference between male and female salaries on a pivot table in a test just now, even with a couple “interesting” calculated field attempts. Pivot tables are great and sometimes easy solutions for some things, but sometimes you want just a bit of summary info, not a whole report, or various bits of summary info that are not closely related and wouldn’t appear on the same pivot table together. Once you get the basic concept of array formulas and using aggregate and if() functions in them, it might actually be easier than creating a pivot table for a small set of criteria.
@Asa Use calculated Items instead of Calculated Fields. When using Calculated Items, make sure you don’t have any of your data grouped otherwise Excel will throw an error.
@Business Bear
I think it comes down to the users preferences and familiarities.
You can do this and many other similar functions in Pivot Tables and probably just as quick as writing a custom formula.
.
As Cameron pointed out, the ability to use a custom formula to derive a value for later inclusion in a dashboard or other report is probably the most powerful reason to learn custom Excel formulas.
.
I personally, rarely use Pivot tables except when summarising large data sets, and even then I generally prefer my own formulas.
.
The purpose of the Formula Forensics series is to explain how custom formulas work and so hopefully expand peoples knowledge of what and how to achieve various results in Excel.
.
Thanx for your contribution to the discussion.
@Hui, I forgot to take into consideration that this is the “Formula Forensic” series.
I often analyze large data sets of schedule and financial data, and use a combination of pivottables, tables, and formulas. For the majority of my adhoc analyses I use pivottables, but my dashboards are all driven by formulas. I tend not to use array formulas because of performance issues though, but I do use them sparingly.
Instead of array formulas why not use Excel’s database formulas (ie DGET, DSUM, etc.)?
Most users are at least semicomfortable with pivottables, but once you start throwing around array formulas, all bets are off.
@Business Bear
Nope, This was a regular post
“Formula Forensic” posts are clearly labelled as such
I think you’ve posted this against the wrong post though?
I’ve run into Unicode characters in in pasted and imported data before, one time with helping a forum member here. Here are UTF16 UDF functions equivalent to ASCII’s CODE() and CHAR(), — untested — complements of Vladimir/ZVI @ http://www.mrexcel.com/forum/showpost.php?p=2120281&postcount=24 –
Function Uni2Hex(Txt As String) As String
Dim b() As Byte, i&, j&
b() = Trim(Txt)
j = UBound(b)
For i = 0 To j Step 2
If i < j Then Uni2Hex = Uni2Hex & Format(Hex(b(i + 1)), "00")
Uni2Hex = Uni2Hex & Format(Hex(b(i)), "00")
Next
End Function
Function Hex2Uni(HexCode As String) As String
Dim b() As Byte, i&, j&, s$
s = Replace(HexCode, " ", "")
s = Replace(s, "0x", "")
j = Len(s)
If j <= 4 Then
ReDim b(1 To 4)
b() = ChrW("&H" & s)
Else
ReDim b(1 To 8 )
b() = ChrW("&H" & Mid$(s, 1, j – 4)) & ChrW("&H" & Mid$(s, j – 3))
End If
Hex2Uni = b()
End Function
The above function identified the mystery letter as Unicode 0×0421, which is Cyrillic Capital Letter Es and looks like C (actual character: ?).
Hi chandoo and friends!!
I’m new to chandoo but i think u’re a true lifesaver. The above article like many others was very useful, especially the methods you described to track down problem cells.
Also, I wanted some help on a formula. I was thinking maybe you can give us a formula to calculate the payback period of an investment. For example if i invest 100,000 today and recieve 20,000, 50,000 and 60,000 in Year 1, 2 and 3, the payback is 2.5 years or 2 years and six months. I found some formulas elsewhere and also developed my own using “Match”, “Offset” etc. but it was vey complex. I wonder if a simple and precise formula is possible. I’ll share my own also if u want me to.
Thanks
Zeeshan
Zeeshan,
While not really related to the topic of this post, here are a couple methods for calculating payback.
Type this in A1:C5
A B C
1 Year Cash in(out) Payback
2 0 100000
3 1 20000
4 2 50000
5 3 60000
While in cell B1, enter this named formula =Sheet1!$B$1:$B1, name it rng.a
In C2 paste in this formula
=IF(SUM(rng.a)>0,ABS(SUM(rng.a)rng.a)/rng.a+COUNT(rng.a)2,0)
copy down to C5. C5 will = 2.5
The next method uses 3 named ranges/formulas and one cell to provide the answer
Keep the data you just typed in A1:C5
In the name manager
1/ enter this dynamic (optional) range, named d
=Sheet1!$B$2:INDEX(Sheet1!$B:$B,MATCH(1E+306,Sheet1!$B:$B,1))
2/ enter this named formula, named dMMULT
=MMULT(–(ROW(d)>=TRANSPOSE(ROW(d))),d)
3/ enter this named formula, named MatchRow
=(MATCH(TRUE,dMMULT>0,0))1
Close the name manager and then in any cell enter this formula
=(MatchRow1)+ABS(INDEX(dMMULT,MatchRow))/INDEX(d,MatchRow+1)
The result will be 2.5 as well.
I tested a few examples with each and they all worked out as expected. Let me know if you encounter otherwise.
The “Type this into A1:C5″ part didn’t format as expected.
Just to be clear, enter in:
A1:A5……..”Year”; 0; 1; 2; 3
B1:B5……..”Cash In/(Out)”; the investment and cash flows
C1:C5……..”Payabck”; the formula =IF(SUM(rng.a)>0,ABS(SUM(rng.a)rng.a)/rng.a+COUNT(rng.a)2,0)
Thanks
@Mike86: thanks!
What is “MaxIf” for??
Can someone give me a scenario when I would want use it? It seems very interesting.
@Asa, Thanx for the response
.
@Isela, Did you read the Formula Forensics Post ?
http://chandoo.org/wp/2012/01/24/formulaforensicsno008/
That also shows examples of its use.
Hi Isela,
“MaxIf” gives the maximum value of a subset of the data that meets a certain criteria.
For example, you can get the maximum salary for employees if they are female (maximum salary of female employees).
=MAX(IF(Sex=”FEMALE”,Salary))
Where Sex and Salary are named ranges for the data area for that field/column. CtrlShiftEnter to confirm entry of the array formula.
To further break down your analysis for females for some particular positions (here the * behaves like the word “and” to specify multiple criteria):
=MAX(IF(Sex=”FEMALE”*Position=”MANAGER”,Salary))
If you wanted to see the difference between the best paid female manager and the best paid male employee:
=MAX(IF(Sex=”FEMALE”*Position=”MANAGER”,Salary)) – =MAX(IF(Sex=”MALE”*Position=”MANAGER”,Salary))
@ Kyle
Thanks a lot man. Herez wat i had initially done. row c32 to r 32 contain the cashflows. I add one row where c33 to r33 had the cumulative cashflows. on B34 i wrote”=MATCH(0,D33:R33,1)” and named it “p_year”. This gives me the year figure. Now i enter the payback formula”=CONCATENATE(p_year,” years and “,ROUND(((OFFSET(C33,0,p_year))/OFFSET(C32,0,p_year+1))*12,0),” months”)”. This give me the actaul back in years and months.
let me knw what do u think.
ps I’m sorry guys for writing all this in the wrong topic but i’m not sure where to discuss this!!!
biggest problem i had is trailing spaces and values as text. i now use “trim” and “clean” and “value” with troublsome and untrustworthy data sources.
current most interesting data issue was a cut and paste of data with a column of dates from Excel in Mac to Excel in Windows. they were all off by 1,462 days. had to change the base date in the Mac Excel from 1904 to 1900 to match the base date in Windows Excel.
The solution i’ve always used when importing data from access is to set up a macro to export all the data as text and then import the text file to excel, this brings the fields across as text and allows vlookups etc to work as intended without mucking about, using a simple bit of VB code as below allows you to run the export macro from access (or any other macro) meaning after the initial setup oin access you can refresh the data set entirely from within excel
Sub RunAccessMacro()
Dim strDatabasePath As String
Dim appAccess As Access.Application
strDatabasePath = “DB path”
Set appAccess = New Access.Application
With appAccess
Application.DisplayAlerts = False
.OpenCurrentDatabase strDatabasePath
.DoCmd.RunMacro “Macro Name”
.Quit
End With
Set appAccess = Nothing
End Sub