• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

breaking a dataset on quartiles by sum of values not the number of entries in the each bucket

vbreusov

New Member
I'm struggling to figure out how to break my dataset to four buckets that have approximately equal sum of values each where 4th bucket contains n largest numbers and 1st bucket contains m smallest numbers in the dataset. See example attached. Appreciate your assistance. Best, - Slava
 

Attachments

  • example_vbreusov.xlsx
    13.3 KB · Views: 10
Hmm, I don't think you have the correct concept of Quartile.

You will likely need to set up Solver model to iterate over. But this isn't easy to set up.
There are number of cases where it can throw kink in modeling.

Largest issue that I'd see is... How do you treat ties?
I.E. there being multiple instances where same value appears. What if by adding all of these value in same quartile, it throws off condition of your composition?
EX: If value of 1.01 appears 200 times in data set... how would you allocate each value to different quartile based on sum value?
 
TY for your reply Chihiro. Lets call it a bucket not a Quartile.

"I.E. there being multiple instances where same value appears. What if by adding all of these value in same quartile, it throws off condition of your composition?" - ties are possible, but in reality in my dataset no more than 3-5 ties occur and most of them on low value level so the chance that the condition of my composition will be violated is negligible.

Solver is exactly what I'm trying to get rid off and in any case I'm not sure if it really helps. I can easily build a 4th bucket myself as I know that 9 largest values composite roughly 1/4 of total sum of values, but going down the buckets makes it challenging.

The precise 1/4 decision isn't needed more important that 4th bucket had around 25% of total and had largest values, while 3rd bucket had around 25% of total and had largest values excluding values already used in 4th bucket... and so on.
 
Hmm, can't think of way to do it using single model. Perhaps you can use OpenSolver add-in.

Alternately, it's simple enough to manually simulate, or use multiple solver model to solve this.

In H3.
=SUMIFS($A$3:$A$386,$A$3:$A$386,">="&LARGE($A$3:$A$386,F3))
In H4:
=SUMIFS($A$3:$A$386,$A$3:$A$386,"<"&LARGE($A$3:$A$386,F3),$A$3:$A$386,">="&LARGE($A$3:$A$386,F4))
Copy down to H6.

Then in F6:
=COUNT(A3:A386)

In I2:
=H3>=H6

In I3:
=H3>=G3

Then you just need to increment values in F3 until condition is met in I3. (ex: 3 in this case).

After that, it's just matter of changing values in F4 & F5 to meet desired result.
Personally, I'd add more constraints for H4 & H5 value (i.e. closest value without going over 96.46).

So each value from F3 to F5 are variables. Solution being H3 to H5, and set up constraint for each.

Ex:
For H3. Must be greater than or equal to G3. And minimize absolute variance from G3.
For H4. Must be less than or equal to G4. Minimize absolute variance from G4.
Same as H4 for H5. But also add that H3 > H6.
 
Last edited by a moderator:
Back
Top