• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

How VLOOKUP works!!

Hi Dada,

As VLOOKUP does a Binary Search, if you go on applying that technique on your data set, you will some how digest the logic of why Excel returning those results.

Regards,
 
Hi Misra ,

The following link suggests that VLOOKUP does not do a binary search when the fourth argument is TRUE.

http://www.excel-university.com/perform-approximate-match-and-fuzzy-lookup-in-excel/

The way that the function actually works when TRUE is selected is this: it walks down the list row by row, and ultimately stops on the row that is less than the value and where the next row is greater than the value. This is why the lookup range must be sorted in ascending order for the function to return an accurate result when the fourth argument is TRUE.
Narayan
 
Hi Misra ,

I have no idea , since I have not studied it ; I was merely giving a link to material on the Internet.

If you come across any other link which gives more information , please post it.

Narayan
 
@NARAYANK991 Sir,

Well I did not got an exact link to prove it here what @Colin Legg says on his post:
http://colinlegg.wordpress.com/2012/03/25/binary-searches-with-vlookup/

When VLOOKUP is doing an approximate match, it relies on the fact that the lookup column is sorted in ascending order. This means that it doesn’t have to search in a top to bottom manner; rather it can quickly eliminate chunks of data by jumping up and down through it (binary search). When VLOOKUP is searching this way it will return an exact match if possible, otherwise it will return the next largest value that is less than lookup_value.

So applying the binary search technique on the data set presented above, let say
I i.e. 79 is bigger than lookup value 78, so it will discard all the values in below range. In the next batch D 98 is in middle which is bigger than lookup value. So again it will discard the below values, No in the remaining junk of data 61 is middle value, which is smaller than lookup value so the only answer left is 71, so it is returning C.

This logic is working fine if you have odd number of elements in the list but with even numbers you have to somehow digest :) the result.

Regards,
 
Last edited by a moderator:
I will try to explain but I fear that I provide explanations which are horrible. It will be good if you can get hold of following book.

VBA Developer's Handbook [Ken Getz & Mike Gilbert] where they have provided the code for binary search in Arrays which works much the same as VLOOKUP binary search.

Binary search does not evaluate each entry instead it chooses a middle entry to compare.
  • In your data set it comes at middle value of 79. Both 9,10 can be argued to be middle values. But it chooses the first set if the number of elements are even.
  • Now since 79 is bigger it drops the all set after position 9 (includes 79 in the dropped set as it is higher than value being searched for) and checks the first set. Check this by changing value of 79 to a value less than 78. If you choose 78 then it will stop search right there. And it applies to any mid point of the code search.
  • Now out of 8 elements it checks for entry number 4 which is 98 and it is higher than what we are setting out to check. So it again discards second set and concentrates on the first set of 3 values.
  • At position 2, it finds 61 which is a value lower than the value being searched for. So it discards all part before 2.
  • Now here comes the final part, there's only one element to compare. If it gets the equal or lower match then it will return that result otherwise it will return the result of penultimate midpoint.
If we check, binary search evaluated element 9,4,2,3 out of 18 to get results. This makes huge difference when we have very large dataset.

So basically it depends on the midpoint.
  • If it is equal return the result.
  • If it is greater than then check on the left half.
  • If it is less than then check on the right half
Hope this helps somewhat :)
 
Last edited:
Great! Will look forward to it.

I have a dog-eared copy of it (2nd edition). VLOOKUP article that Somendra has pointed to is one of my favorites. I just like the way Colin has written that article. I read it once in a while.
 
So after going through a lot of papers on binary search, it clear that binary search will only works on sorted data and that'w what VLOOKUP help also tells

If range_lookup is either TRUE or is omitted, the values in the first column of table_array must be placed in ascending sort order; otherwise, VLOOKUP might not return the correct value.

and if data is unsorted than the result are undefined.

Below are some snaps of test I performed on random data and their result.

c2.JPG


So what I did, I made the table from A2:B18 and then alligned them horizontally in F5:V6 for a better visual look. Lookup value is 35 in E1 and result in F1 with the formula =VLOOKUP(E1,A2:B18,2)

Now let see another senario.
c3.JPG

So lets analyse this first, Excel Will see the mid value here which is 36, which is greater than 35, so it will go for second pass from A2:A10, here mid value is 13 which is smaller than 35, so in third pass it will check A6:A10, here mid value is 11, which is smaller than 35 so it will go for A8:A10, here the mid value is 56, which is again greater than 35 so, no more turn to go so it will return the lesser number i.e. 11 and the result is G.

So, with these two cases we can see that binary search is working correctly,

Now let see when it fails.

c5.JPG

So going with same logic here it is failing.

So, I think sometime it will work, sometime not that's why the result is undefined.

But this again is a general analysis which don't holds any backup documentation from Microsoft, so can't bet on it.

Regards,
 
Somendra,

It is still following the same logic IMHO for second example.

The array is of 17 elements so midpoint is 9th element
1. 56 is higher so it discards all elements after 56 including 56.
2. Now we have 8 element array so 4th element is checked which is 22. It is smaller so all elements before 22 (including 22) are discarded. Now it will check array {32,47,13,14}.
3. It checks 2nd element which is 47 which is higher so it comes to the balance element which is 32 which is lower than 35 so it settles for it.
 
Thanks for posting this question and the insightful explanations given in the responses, i had never given this much thought before. Here is another simple test which indicates the binary search method is used by lookup functions (when the match_type is not exact):

=MATCH(1,{0;1;0;1;0;1;0})

This returns 4 which is the middle value in the list and just by changing the 1's to be 0,1, or 2 we can get any other value from 1 to 7 as one would expect from the bisection method.

As The article by Colin Legg points out, the binary search algorithm also explains why formulas like:

=LOOKUP(LargeNumber,A:A)

can be used to return the last value in a list even when data is not sorted(!) This is because LargeNumber is chosen to be greater than any value in the list and so the data is bisected successively up to the last value.

[LargeNumber can be any sufficiently large value, to be safe the largest possible number can be used: 2*(2^1023-2^970).]
 
Last edited:
Thanks Lori for LargeNumber part.

I went through the link that Narayan posted. After reading the blog for some time I think somehow the author has based his opinion more on the context of data being sorted while using Range_Lookup and how then Excel returns the result.

While end result would be the same for sorted data with that logic, I'm not sure if Excel really does it Row by Row in case of Range_Lookup. In his first sample where, ABC Company returns match for A and not for ABC Company, Inc. is more related to data being higher value (ABC Company, Inc. > ABC Company).
 
@Lori

Well all that I read at various sources, you are right, all lookup functions of Excel when searching an approximate match uses Binary Search method, and when they perform an Exact Search they shift to Linear Search.

But with all the condition is same, DATA should be sorted in ascending order.

And the large number you gave is the largest allowed positive number, but 9.99999999999999E+307 is the largest number you can type in a cell and can be used in LOOKUP function (=LOOKUP(9.99999999999999E+307,RANGE) to get last entry or MATCH function (=MATCH(9.99999999999999E+307, RANGE) to get last co-ordinal position in the range, and same goes for text you can use a big string like "zzzzz" or Omega (insert through symbol) to get last string or last string position in a Range.

Regards,
 
Hi Somendra - good info on lookup values to use to match the last entry. You highlight that data should be in ascending order with an inexact match, as it says in MS help, but then also raise one (perhaps the only?) case where you can use unsorted data, as Colin describes in the third example from the above link.
 
Spot on, with very large number the binary search slides in only one direction i.e. towards bottom / right side of the array and hence the sorting does not matter. But the terms are pretty clear. You want to retrieve the last record.

But if you are doing range_lookup it is imperative that the data has to be sorted.
 
Indeed. A possible alternative to a formula like:

=LOOKUP(C1,A1:A4,B1:B4)

for finding an approximate match in numeric data that need not be sorted is:

=LOOKUP(1,1/FREQUENCY(-C1,-A1:A4),IF({1},B1:B4))

eg with A1:A4={1,5,8,3} B1:B4={1,2,3,4} C1=4 the first formula returns 1 and the second 4.
 
Hi Lori ,

I think the simple fact is that when Excel expects sorted data , giving a function unsorted data means the output is not to be relied upon.

Try both the formulae you have posted with the following data :

A1:A10 - 85 ; 44 ; 79 ; 83 ; 42 ; 43 ; 15 ; 31 ; 5 ; 54

B1:B10 - 1 ; 2 ; 3 ; 4 ; 5 ; 6 ; 7 ; 8 ; 9 ; 10

C1 - try with the values 53 , 17 , 85.

If I have not understood you , please correct me.

Narayan
 
If it were for the result then we could use following as well:
=LOOKUP(C1,SMALL(A1:A4,ROW(A1:A4)))
which does the sort in memory.

And one more idea which should work as yours would be:
=INDEX(B1:B4,MATCH(LOOKUP(C1,SMALL(A1:A4,ROW(A1:A4))),A1:A4,0))
 
@NARAYANK991, i see how others might have been mislead by what i wrote before. Here's an attempt to give some more details and explain results from your sample data but i fear my explanations are much less enlightening than yours.

First, regarding VLOOKUP with Range_Lookup=TRUE, i agree this makes little sense on unsorted data and generates potentially misleading results (unless the lookup value is very large.) The only point of doing so is to try to uncover the method that is being used. Others already gave fairly detailed explanations to make sense of the seemingly random results. After following these explanations and some further experimentation, this is essentially my understanding of the algorithm that is used:

We start with an array of numbers and a lookup value. Then we select a number at the midpoint of the array. The lookup value is compared with that number and its successor. If the lookup value falls between the number and its successor, the number is returned. If the lookup value does not fall in this interval we eliminate the array of values on the left side if the lookup value is larger than the number or else eliminate the array of values on the right side if the lookup value is larger than the successor. Then we repeat the process from the start. In the end a final value is reached and if the lookup value is greater than this final value the final value is returned, else an error is thrown.

Turning to the sample data, i have attempted to list the supposed paths of numbers taken to reach the result by applying this method.

=LOOKUP(Lookup_Value,{85;44;79;83;42;43;15;31;5;54},{1;2;3;4;5;6;7;8;9;10})

53: 9 [42 -> 31 -> 5]
17: #N/A [42 -> 44 -> 85]
85: 10 [42 -> 31 -> 5 -> 54]

With sorted data the results make more sense! The second formula i posted above effectively automates the sort process.

=LOOKUP(Lookup_Value,{5;15;31;42;43;44;54;79;83;85},{9;7;8;5;6;2;10;3;4;1})

53: 2 [43 -> 79 -> 44]
17: 7 [43 -> 15]
85: 1 [43 -> 79 -> 83 -> 85]

In terms of efficiency, the second formula seems to be roughly linear time and can be used ok on an entire column. Sriva's formula would also be fine on small data sets but sorting the array using SMALL involves doing binary compares on all elements i.e. quadratic time.

By comparison, the first formula would be expected to be much faster (logarithmic). However, it seems to be not fully optimal in some cases. For example if Column A consists entirely of zeros, filling down =LOOKUP(0,A:A) takes many seconds whereas filling =LOOKUP(1,A:A) is near instantaneous. This suggests that when many successive values are equal to the lookup value these are iterated through linearly.
 
Last edited:
Hello Friends,

I don't know much about linear or binary search technical details used in VLOOKUP function. Possibly an alternate on unsorted data,

=VLOOKUP(MAX(IF(C1:C18<=E1,C1:C18)),C1:D18,2,0)

If there is any duplicate, to get the last value (L) in Deb's sample,

=LOOKUP(2,1/(C1:C18=MAX(IF(C1:C18<=E1,C1:C18))),D1:D18)
 
Hi Lori ,

Thank you for the clarifications ; my only point in giving the examples was to show that the formulae posted were not giving the correct results ; if a formula is not giving correct results , then understanding it is , at least in my opinion , not necessary , especially since we are going against what the function expects ; if the function expects sorted data , then giving it unsorted data , and then trying to understand why it gives wrong results ; I leave such matters to researchers.

If a function does something correctly , like it did in the case of the NPV function , that is something I would be interested to explore and understand.

Narayan
 
Back
Top