So here is a news from strange but true department. Microsoft Excel blamed for gene study errors [bbc.com].
Microsoft’s Excel has been blamed for errors in academic papers on genomics.
Researchers trying to raise awareness of the issue claim that the spreadsheet software automatically converts the names of certain genes into dates.
Gene symbols like SEPT2 (Septin 2) were found to be altered to “September 2”.
Aah, classic!
This is what happens when you spend countless hours learning genome sequencing and very little about the software tools where your data goes. May be we need bring clippy back to warn people about such sticky situations.
All jokes aside, here is a public service announcement for you. Beware of helpful features in Excel like auto correct, flash fill, auto fill, scientific notation etc.
Here are a few tips for you if you find yourself coding genome in Excel (or something similar)
- Use TEXT format for data that contains possible dates, values that start with = etc. To set TEXT format, select data entry range and use Home > Number > Text
- This can deal with cells that contain possible dates, credit card numbers, very long numbers, leading zeros, fractions, values that start with = (which Excel thinks formulas )
- When importing text files to Excel (like your genome sequence data or what have you), select text as data type for the columns that can be misinterpreted by Excel.
- If a cell starts with = and should not be treated as a formula, prefix the cell with ‘ apostrophe
- Disable features like Flash fill, auto complete and percentage entry if you must
Help the hapless, share your tips
Now its your turn. Please share your tips to handle situations like these. Post your tips in the comments box.
More reading:
Before you embark on saving sensitive stuff in spreadsheets, soak up some survival skills:
7 Responses to “PSA: Don’t let auto correct spoil your party”
Thanks as always for your tips!
My colleagues have already noticed that I'm getting awesome in Excel 😀
#1 tip: Check your data. Don't assume that your software tool, whichever it is, matches your naive expectations of perfection, or that you have always used it correctly.
There's a skills certification called "Spreadsheet Safe" and one of their points is "validate CSV file imports."
http://www.spreadsheetsafe.com/us_main_page_section/services/
Most, but not all of these data files are imported as text or csv formats from instrumentation such as DNA sequencers, gene microarrays or proteomics screens. So your second solution is the correct one in this case: "When importing text files to Excel (like your genome sequence data or what have you), select text as data type for the columns that can be misinterpreted by Excel."
[…] http://chandoo.org/wp/2016/08/26/psa-dont-let-auto-correct-spoil-your-party/ […]
Once you are too deep in the matter, you often don't see simple mistakes or errors any more. So ask a (critical) colleague to test or review what you have done. Or explain him/her how you did it. Allow why questions, please.
Alternative: apply some agile pair "programming" (also valid for VBA). 2 brains have more brainpower and 4 eyes see a hell of a lot more then 2.
I learned the hard way to import all field as text. Certainly for dates: where server settings and local settings (per user!) are different, trouble is guaranteed! ( 04/05/2016 (Belgium date, 4th of May) 05/04/2016 (American date, 4th of May). You calculate negative service levels or lead times, hey?
Another version of the date issue I experienced. Data was exported from SAS to a CSV, but the dates only had a two digit year. Therefore dates that should have been 2050 etc were imported from the CSV as 1950!
Daveycroc, see the MS support pages for
How Microsoft Excel works with two-digit year numbers
Change the date system, format, or two-digit year representation
Thanks Patrick. I dealt with it but it took some head scratching to figure out why so many mortgages had passed their expiry date! lol