• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Correlation using excel formula

ssuresh98

Member
Hello,

Can someone help me with this problem?


I have an array, say A 1 2 3

I would like to find the maximum correlation for this array in a database,

say B 2 3 4

C 3 4 5

D 1 2 3

E 3 2 1

Is this possible by using excel formula?


TIA
 
Tia


I think you are trying to compare data B, C, D & E to A

Try: =CORREL($B$1:$D$1,B2:D2)

Copy down


In your case they all correlate perfectly except E
 
Hi Hui,

Thanks for your help.

I am sorry I think I worded my problem incorrectly.

I have a master database with the following fingerprints.


B 1 1 0

C -1 0 0

D 1 0 -1

E 0 0 -1


There are only 3 variables (1, 0, -1)

I run an assay for an unknown sample and generate fingerprint

eg, A 1 0 -1

the problem is, I want to query this fingerprint with those in the database and find the sample which has max correlation to A (Answer is D).

Is it possible to do this using excel formula?


TIA
 
A few questions:

Are there letter codes for all 9 possible combinations, or is it just the 4 above?


If not, how do you determine "best"? e.g. the series 1 0 0 is equally close to B and D, which would you want as your result?
 
Here is a quick method

https://www.dropbox.com/s/q0kxu7cz0xdg907/Max%20Correl.xlsx


George F, Best means highest correlation ie: closest to 1
 
Hi Tia,


I assume your data is arranged as follows from A1 to D4:


B C D E

1 -1 1 0

1 0 0 0

0 0 -1 -1


Where B,C,D,E are your samples and there are three unique variables (0,1,-1) which can be present in any combinations for each one of these samples.


Would that serve your purpose if we identify the number of unique variables for each one of these sample. The sample which has maximum number of unique variables (in this case it's can not be >3) should be considered to have maximum correlation with sample A.


If this approach is fine then write the following formula at A6 and drag across the columns:


=SUMPRODUCT(1/COUNTIF(A2:A4,A2:A4))


In Col C for sample D, the formula will return you 3. Hence, we can say sample D has maximum correlation with sample A.


Please let us know if this is fine.

Kaushik
 
Hui - in your spreadsheet if you put 0, 0, -1 in as your search value you get B as your output, not E. I think this is the only combination of -1, 0 and 1 that doesn't work for it though.

This is because the gradient of the two lines are the same, and this is what Correl() is comparing, so they both output 1, even though one of them is an exact match. This is carried forward in the Max Correl as Index returns the first hit (value of 1 it finds).


I can't think of a way of overcoming this without putting in a specific exception.
 
Hello All,

Thanks so much for helping me out.


Hui's max correl is the closest to what I need.

We have a reference database for fruits. Each fruit has its own unique fingerprint. Now we have some unknown samples which we need to identify. Our assay generates data which needs to be queried against each fruit in the reference database (SNP1 to SNP1, SNP2 to SNP2 etc) and return the name of the fruit which has max correl. In reality we have 96 SNPs to query for each sample, to simplify I have listed only 10 SNPs.

Since we have 1000s of samples to query Hui's format will not work out. I have posted the data in the required format.

Many thanks for helping out.


https://www.dropbox.com/s/g52hdegk1o61yoq/Max%20Correl.xlsx
 
Back
Top