• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Extract duplicate records and Calculate Z-static Chi-Square Static

persol

New Member
Hi folks,

For a long time I have been using a simple formula that helps me detect and filter out duplicate records as in a accounts payable (duplicate invoices or payments) or an accounts receivable file with duplicate invoices or amounts, etc. The formula I have been using is: IF(COUNTIF(B:B,B2)>1,"Duplicate","") or IF(COUNTIF($B$2:$B$200,B2)>1,"Duplicate",""). The problem I have is that this formula eats up memory or computing resources calculating large volumes of data as would be with records upwards of 15thousands raws. Forget larger files, I usually leave my laptop working the numbers overnight!!! Yea guys, you see where I am going with this?


QUESTION (1): So, I was wondering if anyone out there or even Chandoo himself could tackle this problem. How should I find and isolate duplicate transactions - faster and efficiently?


QUESTION (2): Iam curious to know if anyone knows how to determine statistical computations such as the famous Z-STAT AND CHI-SQUARE? I own an old mac laptop loaded with Excel for mac 2011 and it does not have the MS Excel addin. Does any one know how to address this?


RESOURCES: For your reference I have uploaded a file MediaFire.com (see link below) with the above formulas and fake data in (Chandoo_Test (Duplicate and Stats).xls) so that anyone out there can play around and test it. The test data is located on columns A through G, my duplicate results are in columns H through J. Finally, my statistical questions (if anyone knows how to find these stats would go on columnsK and L.


Chandoo_Test (Duplicate and Stats).xls

http://www.mediafire.com/?4bqckej5088snx1


I will greatly appreciate your help, Cheers

persol
 
Hi persol,


I apologize if I'm stating the obvious but if you have Excel 2007 or later, you could use the Remove Duplicates tool in Data Tools. Mr.Excel had a youtube cast on the feature here:


http://www.youtube.com/watch?v=JK-KtWscBX4


He also has a page on his site concerning other methods:

http://www.mrexcel.com/tip138.shtml


As to the statistical questions, I don't get into that myself -- but I'm guessing that the formulas native to Excel =Z.TEST or one of the =CHISQ.[varies] don't work for your example? Is there a reason why they won't?


In any case, good luck!
 
Hello Daffy333,

great suggestions. I found part of my answer. The trick I will use going forward is the conditional formatting. it does not involve formulas (which takes time computing in large files), the conditional formatting can also be filtered by color an column - which was my main need. I just tested it with a 1million raws and 20 column file and it was faster than ever!!! I really appreciated that tip, thank you so much.


Now, the fun part, the statistical chi-static and the z-test. I found those formulas within Excel but am not sure how to apply on a file that is similar to the one I posted on the website for all to see. The Excel formula calculates only on static set of data not on sequential sets of data as is the case with my data file. I am driving at simply getting the Chi-static and the z-test for each line item

then how do you interpret it? I looked it up in the internet and there is a ton of abstract reading material but it just did not make a lot of sense or show in plain English what the resulting outcomes mean if applied to the real world data. I was hoping that someone in this forum would be familiar with either statistical measure and perhaps share experiences or understand it a bit more.


Needless to say, I am thrilled by the duplicate coloring tool to find duplicates that Excel added, I say it was a great improvement. Thank you so much for guiding me to it.


Cheers,

Persol
 
Back
Top