Sorting a list of comma-separated lists

Peter Bartholomew · Aug 9, 2019

What ideas do you have on reducing the content of a set of comma-separated lists to a single column range (sorted and without duplicates if you choose).
I am excluding VBA, PQ and manual operations such as 'Text to Columns'.

I have included a solution to explain the situation. My solution is lightly protected so it should not be immediately obvious by inspection. I am interested to see what solutions are out there.

vletm · Aug 10, 2019

Peter Bartholomew
The first idea was
... that those sets has done,
because there are same first name students.
That means, there are no duplicate students.
I would use VBA to split those rows and after that sorting it.
I would always use also last names with first names with my list to avoid mistake to loose any student from the final list.

AliGW · Aug 10, 2019

I am excluding VBA, PQ and manual operations such as 'Text to Columns'.

Why?

Is this a query or just a challenge?

PS Your student list has leading spaces - you might wish to get rid of those.

GraH - Guido · Aug 10, 2019

I don't have all functions on board, and since I'm lazy would always use PQ for this, @Peter Bartholomew.
perhaps this could work?
=index(FILTERXML("<a>"&SUBSTITUTE(textjoin(",";true;Students);",";"")&"</a>";"//b");row(a1)

Peter Bartholomew · Aug 10, 2019

AliGW said:
Why?

Is this a query or just a challenge?

PS Your student list has leading spaces - you might wish to get rid of those.

It was part of a query I answered on Microsoft TechCommunity. The question was how to build a list of dates (row numbers) associated with each name. It turned out that getting the unique list of names in the first place was the hardest part. I came up with a solution but it involved some lateral thinking. Before embarking on a session of egotistical self-congratulation, I thought I would post it here as a challenge in case someone else comes up with a simpler approach. You already get kudos for being more observant than me, I returned duplicate names before realising that the data needed cleaning; an embedded TRIM sorted the problem.

As an aside, how are you getting on with dynamic arrays? I have been using them continuously (spreading even more confusion than is my norm).
I find that they support a very different thought process in which the objects used to construct solutions are entire arrays, occasionally broken down into a set of rows or columns, but never the individual cell. "... and copy down" has virtually disappeared from my vocabulary.

GraH - Guido · Aug 10, 2019

Went along with the challenge:

since I don't have textjoin in my armory, I concatenated the values in the total table row
- -> name manager: vAllStudents
this variable was altered in a list of names with the formula:
- =FILTERXML("<a>"&SUBSTITUTE(vAllStudents;",";"")&"</a>";"//b")
- -> name manager: ListStudents
Then made the list using (D6 down):
- =IF(COUNTA(ListStudents)<ROW(D1);"";INDEX(ListStudents;ROW(D1)))
Made a second list called "ListExtracted" using:
- =OFFSET(Sheet1!$D$6;;;COUNTA(ListStudents))
Finally in F filtered/sorted the list with this array formula:
- =IFERROR(INDEX(ListExtracted;MATCH(0;COUNTIF(ListExtracted;"<"&ListExtracted)-SUM(COUNTIF(ListExtracted;D$5:D5));0));"")

Would have liked to get there with only using "ListStudents", but that got me nowhere.

EDIT: just saw, I'm missing a bunch of names....

Peter Bartholomew · Aug 10, 2019

vletm said:
I would always use also last names with first names with my list to avoid mistake to loose any student from the final list.

You are perfectly correct of course. For the purpose of this challenge assume that each name uniquely defines an individual.

Peter Bartholomew · Aug 10, 2019

@GraH - Guido
That is quite an achievement! Especially handicapped by a dated toolset (no TEXTJOIN!).
I did think of FILTERXML but I have never used it and the help page was about as much use as a chocolate teapot. I understand the xml that you have built well enough but what is the "//b" that appears as the xpath parameter?

I think that once you had got to the point of a list, duplicates and all, I would consider the problem solved. The rest, writing over-sized ranges of formulas and counting occurrences are just sordid workarounds compensating for the lack of a FILTER array function, just annoying. There is not even any point in upgrading to Office 2019; maybe revert to Google Sheets until dynamic arrays appear in a static Office version!

AliGW · Aug 10, 2019

As an aside, how are you getting on with dynamic arrays?

I'm not yet, but once the number of people using Office 365 reaches tipping point (I've seen a marked increase over the past month or two), then I'll start playing.

However, I am getting on really well with PQ, and that's what I'd have used for this. It is my go-to tool of choice at the moment.

GraH - Guido · Aug 10, 2019

Peter Bartholomew said:
I did think of FILTERXML but I have never used it and the help page was about as much use as a chocolate teapot. I understand the xml that you have built well enough but what is the "//b" that appears as the xpath parameter?

I don't have a clue! I simply remember an earlier challenge where it was used and thought it could be helpful in solving this one. Thus replicated without giving it a second thought.
The second part (sorting/filtering) came from Mike Girvin, I also remembered there was video on it. I have made a mistake somewhere in my file, because in his sample file with your list it is simply working. (Have to imagine some swearwords to understand my tone...)
So less impressive, hey, now, as I'm just stealing away solutions everywhere. Told ya, I'm the lazy kind

EDIT: found the mistake. I was counting from the wrong list. Proofs your point of cell referencing, right?

So the correct formula is:
=IFERROR(INDEX(ListExtracted;MATCH(0;COUNTIF(ListExtracted;"<"&ListExtracted)-SUM(COUNTIF(ListExtracted;F$1:F1));0));"")

Peter Bartholomew · Aug 11, 2019

Hi Guido
Especially chosen for you!
All you could want to know about Belgian waffles and more.

The attached reads some xml from the w3schools.com website using WEBSERVICE() and then extracts data using the xPath parameter in FILTERXML().
I used CSE arrays but the behaviour is so much nicer using dynamic arrays. Apparently the double-slash as in "//food/price" means any instance of the node irrespective of where it sits in the tree.

Knowing your liking for PQ, I also extracted the information from the string to a table using parse xml.
This may be pretty routine and even boring for others but it was all new to be.

Peter Bartholomew · Aug 12, 2019

Since I see no queue waiting to offer fresh insight on the problem, I will outline the solution I came up with.

I started by trying to separate the comma separated text from each cell but then changed my mind and used TEXTJOIN to form a single string before processing it. At this stage, I had formed the text string
Sam, Tim, Max, Ronny, Rhea, Sam, Tim, Joan, Peter, Eric, Ronny, Sam, Julie, Max
It seems that this was also used by others, Guido included.

The next idea was to convert this list to the format of an array constant. I set the calculation up as a named formula 'string' that refers to
= "{""" & SUBSTITUTE( TEXTJOIN( ", ", 1, Students ), ",", """;""" ) & """}"
The multiple double quotes are something of a nightmare but that's the way it is. If one debugs the expression by using the formula
= string
the result is
{"Sam";" Tim";" Max";" Ronny";" Rhea";" Sam";" Tim";" Joan";" Peter";" Eric";" Ronny";" Sam";" Julie";" Max"}

I then defined a further named formula 'split' that referred to the Macro2 function EVALUATE
= EVALUATE(string)

The final steps of trimming spaces, extracting a list of distinct values, and sorting were achieved with the named formula 'distinct'
= SORT( UNIQUE( TRIM(split) ) )

The workbook shows the output for
= split
and
= distinct

I have used a CSE array for 'split' rather than a dynamic array to reduce Excel version dependence, though not sufficiently for Guido as he doesn't have TEXTJOIN either

.

I seems to work well enough but it is far from a light-weight solution and I wonder where I have missed a simpler idea somewhere along the line.

_____________________________________________
p.s. The workbook is as it was before with the exception that I have added some annotation and run a macro to make the defined names visible

Code:

Sub Unhide()
Dim n As Name
For Each n In ActiveWorkbook.Names
    If Left(n.Name, 1) = "_" Then
        n.Visible = False
    Else
        n.Visible = True
    End If
Next
End Sub

XOR LX · Aug 13, 2019

I devised a solution for this 5 years ago for Excel 2010 using native functions alone:

List of unique entries from column of space-separated strings

Regards

Peter Bartholomew · Aug 14, 2019

@XOR LX
Wow, that was quite a tour de force! I suspect that the existence of modern functions such as TEXTJOIN and now FILTER and SEQENCE would simplify the implementation significantly. Of course, backward compatibility would then be a major problem. My formula
= EVALUATE(string)
has a measure of elegance but, even so, I was not very happy with using an undocumented feature left over from a past era. I had forgotten your idea of padding with spaces and then chopping the resulting string at regular intervals.

An idea I had toyed with was that of using SUBSTITUTE to replace each occurrence of a space using, an array for the instance number, and then using SEARCH to return the corresponding array of locations of the space character.

XOR LX · Aug 14, 2019

Peter Bartholomew said:
I suspect that the existence of modern functions such as TEXTJOIN and now FILTER and SEQENCE would simplify the implementation significantly.

"Significantly" is something of an understatement!

Cheers

p45cal · Sep 2, 2019

I don't have FILTER, UNIQUE, SORT, SEQUENCE, but I do have TEXTJOIN.
In any cell in row 6:

Code:

=MID(SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5),FIND("¬",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5))+1,FIND(",",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5),FIND("¬",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5)))-FIND("¬",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5))-1)

copied down as far as needed will convert them into a non unique/distinct list. I realise I could have used -ROW($A$5) instead of -5 throughout to make it easy to put on any row.

Peter Bartholomew · Sep 2, 2019

@p45cal
Thank you for your solution. I can confirm that it works!
It must have required patience and fortitude (as well as a few applications of cut and paste?) to put the solution together.

In order to analyse your formula, I applied names to elements of the formula in order to reduce it to a sequence of steps I could follow. I sincerely hope you find the process of interest rather than simply objecting to it as interference!

The first repeating element that was visible in your formula was the concatenated list of students, 'concatenation', that refers to
= SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ","")

The next was a row index that I called 'k'
= ROW() - ROW(header)
where ROW(header) = 5

Next I created the character pointer 'p' which gives the first letter of each student name
= 1 + FIND("¬", SUBSTITUTE(concatenation, ",", "¬", k ) )

I tried two formulas for the pointer 'q' that identifies the trailing comma following each name. Your version was
= FIND( ",", SUBSTITUTE(concatenation, ",", "¬", k ), p )
which involved nesting the formula 'p' to provide a starting point for the search. The other approach I tried was to duplicate the formula for 'p' but seek the 'k+1'st occurrence of ',' using
= FIND("¬", SUBSTITUTE(concatenation,",","¬", k+1 ) )
Both worked.

The final worksheet formula was
= MID( concatenation, p, q-p )
I dropped the replacement of a ',' by '¬' because it makes no difference,

Altogether, a good formula that meets the spec without requiring EVALUATE. Also, your original did not require as many words to explain it

Just to 'gild the lily' I introduced a variant that started with a name k′ (k-prime) that refers to
= SEQUENCE(14)
which leads to the spilt array solution
= UNIQUE( MID( concatenation, p′, q′-p′ ) )

John Jairo V · Sep 11, 2019

Hi, to all!

Maybe same of another answers... but this is my approach. Blessings!

astrodon · Sep 22, 2019

p45cal said:
I don't have FILTER, UNIQUE, SORT, SEQUENCE, but I do have TEXTJOIN.
In any cell in row 6:

Code:

=MID(SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5),FIND("¬",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5))+1,FIND(",",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5),FIND("¬",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5)))-FIND("¬",SUBSTITUTE(SUBSTITUTE("," & TEXTJOIN(",",TRUE,Students) & ","," ",""),",","¬",ROW()-5))-1)

copied down as far as needed will convert them into a non unique/distinct list. I realise I could have used -ROW($A$5) instead of -5 throughout to make it easy to put on any row.

I know the constraints of the challenge excluded using VB. But formulas such as the one above make my eyes cross. While it may work as expected, I would think it a real bitch to troubleshoot. A solid candidate for VB just for the commenting aspect alone (not familiar with power query at the time of this writing). BTW, this is not intended to be a criticism.

doco

Peter Bartholomew · Sep 23, 2019

@astrodon

My modification at #17 is less prone to creating visual disturbance but is more alien to a traditional spreadsheet user.

If you or others have elegant ideas for a VBA (or PQ) solution, I am happy to widen the discussion. I guess regular expressions would be overkill, so perhaps 'Split'. Maybe scripting dictionary objects to get the unique list?

Marc L · Sep 23, 2019

As now it's VBA open bar, not one but two demonstrations according to the initial attachment

but with the worksheet unprotected, result list starting at H6 cell …

First one as pure VBA :

Code:

Sub Demo1()
         Dim V, W, R&, N$(1 To 99, 0)
         [H6].CurrentRegion.Clear
    For Each V In Range("B6", [B5].End(xlDown)).Value2
        For Each W In Split(V, ", ")
            If IsError(Application.Match(W, N, 0)) Then R = R + 1: N(R, 0) = W
        Next
    Next
    With [H6].Resize(R)
        .Value2 = N
        .Sort [H6], xlAscending, Header:=xlNo
    End With
End Sub

Second using a Windows object :

Code:

Sub Demo2()
         Dim V, W
         [H6].CurrentRegion.Clear
    With CreateObject("System.Collections.ArrayList")
        For Each V In Range("B6", [B5].End(xlDown)).Value2
            For Each W In Split(V, ", ")
                If Not .Contains(W) Then .Add W
            Next
        Next
           .Sort
            [H6].Resize(.Count).Value2 = Application.Transpose(.ToArray)
           .Clear
    End With
End Sub

Peter Bartholomew · Sep 29, 2019

@Marc L
Nice addition to the discussion. In particular, I liked the "System.Collections.ArrayList".
Out of curiosity though, why do you choose to use direct cell references when there is already a defined name that identifies the input data? Is it just a case of old habits dying hard or do you have positive reasons for favouring the approach?

Marc L · Sep 29, 2019

In fact just 'cause I did not notice it ! (Yes I know if I took a better glance to the previous formula …)

So as a bonus another pure VBA demonstration using a Collection and this defined name :

Code:

Sub Demo3()
             Dim V, W, N&, S$()
             [H6].CurrentRegion.Clear
    With New Collection
        For Each V In [Students].Value2
            For Each W In Split(V, ", ")
                For N = 1 To .Count
                    Select Case .Item(N)
                           Case W:          N = 0:  Exit For
                           Case Is > W:    .Add W, , N:  N = 0:  Exit For
                    End Select
                Next
                    If N Then .Add W
            Next
        Next
            ReDim S(1 To .Count, 0)
            For N = 1 To .Count:  S(N, 0) = .Item(N):  Next
             [H6].Resize(.Count).Value2 = S
    End With
End Sub

salim hasan · Sep 30, 2019

Try my Macro

Code:

Option Explicit
Sub Extract()
Rem====>>Created by Salim Hasan on 30/9/2019
Dim i%, m%, D As Object, col As Object
Dim ky, st
Dim k%: k = 1
Dim LRB%: LRB = Cells(Rows.Count, 2).End(3).Row

Union(Range("D6", Range("D5").End(4)), _
Range("F6", Range("F5").End(4))).ClearContents

Set D = CreateObject("Scripting.Dictionary")
Set col = CreateObject("System.Collections.ArrayList")
 For i = 6 To LRB
   For m = 0 To UBound(Split(Range("B" & i), ", "))
    st = Split(Range("B" & i), ", ")(m)
     D.Add k, st
         If Not col.Contains(st) Then
             col.Add st
         End If
     k = k + 1
    Next m
 Next i
 k = 6
 For Each ky In D.keys
   Cells(k, "D") = D.Item(ky): k = k + 1
 Next ky
 col.Sort
 Range("F6").Resize(col.Count) = _
 Application.Transpose(col.ToArray)
D.RemoveAll: Set D = Nothing
col.Clear: Set col = Nothing
End Sub

Sorting a list of comma-separated lists

Well-Known Member

Attachments

Excel Ninja

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Attachments

Well-Known Member

Attachments

Well-Known Member

Attachments

Active Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Attachments

New Member

Well-Known Member

Excel Ninja

Well-Known Member

Excel Ninja

Member

Attachments