• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Help with formula to catch a substring within a string

iferror

Member
Hi, i need your help in finding a smart way to solve a problem i'm having.

I have a table (CODE_LIST) with a list of item codes
ITEMCODENORMALIZEDCODE
ITEM-01ITEM01
ITEM-1ITEM01
ITEM 01ITEM01
ITEM 1ITEM01
ITEM-10ITEM10
ITEM-11ITEM11

And a table (MAIN_TABLE) where i have a text field "Description", in this column the user can type in multiple codes. As the filed is free text, i have no way to prevent the user to use creative variation of the code....hence the multiple entries in CODE_LIST[ITEMCODE].

In MAIN_TABLE i created a new column (column AA) whith this array formula
=SUM(IF(ISERROR(FIND(CODE_LIST[ITEMCODE];SUBSTITUTE(UPPER([@[Description]]);"_";"-")));0;1))
to count the codes in Description. (yes, there might be rows that have no code in description)

I use this column to populate the next columns 5 where i have this formula
Code:
=LET(
    itemnum; 0;
    indxs; IF(ISERROR(FIND(CODE_LIST[ITEMCODE];UPPER([@[Description]])));0;1)*(ROW(CODE_LIST[ITEMCODE])-1);
    Filtered; FILTER(indxs;indxs>0;0);
    IF(SUM(Filtered)>itemnum; INDEX(CODE_LIST[NORMALIZEDCODE];SMALL(Filtered;itemnum+1)); ""))

where itemnum =0 for column AB, 1 for column AC and so on.

It all works fine, but this approach has a flaw.
Suppose i have code ITEM-10 in Description. The count return 2 and the list of values in column AB,AC.. returns ITEM01 and ITEM10.

How'd you work around this problem?
 
Last edited:
Just to clarify a bit:

This is what i currently have
DescriptionCountColumn ABColumn ACColumn AD
blablabla Item-021ITEM02
blablabla Item 142ITEM01ITEM14
blablabla Item-02 more blablabla Item-113ITEM01ITEM02ITEM11
bla Item 01 bla Item 02 bla Item 123ITEM01ITEM02ITEM12

And this is what I want to obtain
DescriptionCountColumn ABColumn ACColumn AD
blablabla Item-021ITEM02
blablabla Item 141ITEM14
blablabla Item-02 more blablabla Item-112ITEM02ITEM11
bla Item 01 bla Item 02 bla Item 123ITEM01ITEM02ITEM12
 
Test for the count:
Code:
=LET(b,[@Description],a,IF(ISNUMBER(SEARCH(CODE_LIST[ITEMCODE]&" ",b & " ")),CODE_LIST[NORMALIZEDCODE]),COUNTA(UNIQUE(FILTER(a,a<>FALSE))))
and for the items:
Code:
=LET(itemnum,0,b,[@Description],a,IF(ISNUMBER(SEARCH(CODE_LIST[ITEMCODE]&" ",b & " ")),CODE_LIST[NORMALIZEDCODE]),c,UNIQUE(FILTER(a,a<>FALSE)),IFERROR(INDEX(c,itemnum+1),""))
Note that if any item appears more than once in the text string, only one item is counted/shown.
Also note that I didn't need to have the variable b in these formulae but it does make it easier to convert them later to a (named) lambda formula for ease of use.
 
Last edited:
For named lambdas, for the count, name this formula, say UniqueItemCount in Name Manager:
Code:
=LAMBDA(CodeListTbl,Descriptn,LET(c,TAKE(CodeListTbl,,1),d,TAKE(CodeListTbl,,-1),a,IF(ISNUMBER(SEARCH(c & " ",Descriptn & " ")),d),COUNTA(UNIQUE(FILTER(a,a<>FALSE)))))
then use it in the sheet thus (with hints as to what the arguments are:

1716839472338.png

For the items themselves, if you name another formula, eg. UniqueItems
Code:
=LAMBDA(Descriptn,CodeListTbl,itemnum,LET(a,IF(ISNUMBER(SEARCH(TAKE(CodeListTbl,,1)&" ",Descriptn & " ")),TAKE(CodeListTbl,,-1)),c,UNIQUE(FILTER(a,a<>FALSE)),IFERROR(INDEX(c,itemnum+1),"")))
you'll be able to use it thus in the sheet:

1716839737211.png
 
Brilliant, thanks a lot.
The count has a bug, when there's no code it returns 1....but no big deal as i'm only interested in count being >0 so my initial flaw doesn really matter.
The main issue is solved! Thanks again
 
There's another minor flaw... but i know that whenever you let users type in, you must be prepared to get exceptions. When the user does not leave a space after the code, the formula fails ==> Item01-Item02. The new regex function could help here
 
Quick fix for the count:
Code:
=LAMBDA(CodeListTbl,Descriptn,LET(c,TAKE(CodeListTbl,,1),d,TAKE(CodeListTbl,,-1),a,IF(ISNUMBER(SEARCH(c & " ",Descriptn & " ")),d),e,UNIQUE(FILTER(a,a<>FALSE,0)),IF(INDEX(e,1)=0,0,COUNTA(e))))

When the user does not leave a space
I'll have a think on that one.
The new regex function could help here
For yourself, maybe, but at the moment most others won't have it on their version (I don't and I'm on a Preview channel). Besides, now you'll have two problems (Jamie Zawinski).

There may be another way using one of the MAP/SCAN/REDUCE functions with the code list table sorted by decreasing length of ITEMCODE to avoid the problem you had before of it seeing Item1 even though it was part of Item12. I'd aim to find Item12 first and remove it from future searches. I've not thought about it more than that - I won't have time today but I'll mull it over.
 
I'll have a think on that one.
This is the modified version of the lambda, just to take into account some exceptions i found in the data

Code:
=LAMBDA(
    Descriptn,
    CodeListTbl,
    itemnum,
    LET
        (
        a,IF(IF(ISNUMBER(SEARCH(TAKE(CodeListTbl,,1)&"-",Descriptn & "-")),1,0)+
            IF(ISNUMBER(SEARCH(TAKE(CodeListTbl,,1)&";",Descriptn & ";")),1,0)+
            IF(ISNUMBER(SEARCH(TAKE(CodeListTbl,,1)&".",Descriptn & ".")),1,0)+
            IF(ISNUMBER(SEARCH(TAKE(CodeListTbl,,1)&" ",Descriptn & " ")),1,0)>0,TAKE(CodeListTbl,,-1)),
        c,UNIQUE(FILTER(a,a<>FALSE)),
        IFERROR(INDEX(c,itemnum+1),"")
        )
    )

For yourself, maybe, but at the moment most others won't have it on their version (I don't and I'm on a Preview channel). Besides, now you'll have two problems (Jamie Zawinski).
Me neither....to be true
 
Last edited:
Test this one for the UniqueItems named lambda:
Code:
=LAMBDA(Descriptn,CodeListTbl,itemNum,LET(a,SORTBY(CodeListTbl,LEN(TAKE(CodeListTbl,,1)),-1),b,VSTACK(Descriptn,SCAN(UPPER(Descriptn),UPPER(TAKE(a,,1)),LAMBDA(a,b,SUBSTITUTE(a,b,"")))),c,UNIQUE(FILTER(TAKE(a,,-1),DROP(b,1)<>DROP(b,-1))),IFERROR(INDEX(c,itemNum+1),"")))
It crashed my Excel when I tried to put it within a table, but it's OK outside a table. Maybe I have to split it into smaller named lambdas.
It shouldn't matter about spaces now.
This is what it does:
Takes the code list table and sorts it by the length of the first column (ITEMCODE), longest at the top.
This is so I can use SCAN which searches for each ITEMCODE (starting from the top, the longest) and if it finds it in the description, removes it from the description, then moves on to the next ITEMCODE. By removing any string found, when it later searches for shorter ITEMCODES it won't find them as part of a previously found longer ITEMCODE.

We need to adapt the COUNT but I'm out of time for now.

I've not used SCAN much before, so the way I've used it is very probably sub-optimal!
 
If you want to do it as an "exercise" to train SCAN, please go on
anyway, i'm happy with the solution i obtained based on the initial hint from you.

Moreover...consider that the main_table is 50k rows and the item list contains 200 rows, the function as it is already takes some time to refresh, so i am afraid that making the formula more complex can seriously reduce performances.

Thank you again
 
Just an afterthought: what if i add a column in my CODE_LIST where the lenght of the itemcode is calculated, and then sort the table by decreasing itemcode lenght? This way we could use min(index) to get the longest itemcode that matches.
Another option could be to use a recursive lambda:
UniqueItems =lambda(description, listofitems; IF item is found THEN UniqueItems (substitute(description,item,""), drop(item,itemlist) ELSE "")
 
If you want to do it as an "exercise" to train SCAN, please go on
That's gracious of you, but its only partly that. It was mainly because:
When the user does not leave a space after the code, the formula fails ==> Item01-Item02
The aim of the formula with SCAN is for any of the following not to matter: spaces, lack of or not, any punctuation separating/not separating Items. It is considerably shorter than your modified version in msg#8 and probably faster to calculate, although I haven't compared.

Re:
what if i add a column in my CODE_LIST where the lenght of the itemcode is calculated, and then sort the table by decreasing itemcode lenght?
Yes, this would mean you don't need to do it in the formula, but you'd have to remember manually to sort the table should you change the entries in it, also the formula may have to change a bit since that table would have 3 columns and the likes of TAKE in the formula might need to change (maybe not if you put the length formula in the middle column).

use a recursive lambda
I feel that would likely make for long calculation times depending on the depth of recursion. I wouldn't like to compose such a recursive lambda! SCAN does the looping anyway.

I doubt that calculation time is really a big problem for you since you won't be recalculating every 5 minutes. If it is, then consider Power Query which is designed for larger data sets, is more powerful than formulae and will only recalculate when you want it to.
 
That's gracious of you
Hahahahaha.... i meant to say that i'm ok with all the help received so far :)


Yes, this would mean you don't need to do it in the formula, but you'd have to remember manually to sort the table should you change the entries in it, also the formula may have to change a bit since that table would have 3 columns and the likes of TAKE in the formula might need to change (maybe not if you put the length formula in the middle column).
Sure, but the list is not going to change frequently, but i see your point.


I doubt that calculation time is really a big problem for you since you won't be recalculating every 5 minutes. If it is, then consider Power Query which is designed for larger data sets, is more powerful than formulae and will only recalculate when you want it to.
I probably should.... I'm already using powerquery to retrieve the 50k rows from a sql server.
 
Back
Top