• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

VBA - extract unique words from range and count - optimization?

cacos

Member
Hi community!

Some time ago I came accross this code (from Ozgrid I believe), and have been working on different approaches to make it faster (when there's a lot of text, it can take some time).

What it does is extract all unique words from a range, and count the number of times that word appears.

Hopefully someone out there can think of a better approach. Thank you!

Code:
    Dim rngData As Range
    Dim rngCell As Range
    Dim colWords As Collection
    Dim vntWord As Variant
    Dim Counter As Integer
    Dim PctDone As Single
    Dim TotalCells As Integer
 
    On Error Resume Next
 
 
With Sheet5
 
    Set colWords = New Collection
    Set rngData = Sheet5.Range("A2:A" & .Range("A65536").End(xlUp).Row)
 
    TotalCells = rngData.Cells.Count
   
    For Each rngCell In rngData.Cells
        For Each vntWord In Split(Replace(Replace(Replace(rngCell.Value, """", ""), "]", ""), "[", ""), " ")
            colWords.Add colWords.Count + 1, vntWord
            With Sheet1.Cells(1 + colWords(vntWord), 3)
                .Value = vntWord
                .Offset(0, 1) = .Offset(0, 1) + 1
            End With
   
 
   
        Next
 
Next
 
End With
 
a few small changes:
Code:
Sub blah()
Dim results(), rngData As Range, rngCell As Range, colWords As Collection, vntWord As Variant
Dim Counter As Integer, PctDone As Single, TotalCells As Integer
ReDim results(1 To 2, 0 To 0)
On Error Resume Next
With Sheet5
  Set colWords = New Collection
  Set rngData = .Range("A2:A" & .Range("A65536").End(xlUp).Row)
  TotalCells = rngData.Cells.Count
  For Each rngCell In rngData.Cells
    For Each vntWord In Split(Replace(Replace(Replace(rngCell.Value, """", ""), "]", ""), "[", ""), " ")
      colWords.Add colWords.Count + 1, vntWord
      i = colWords(vntWord)
      If i > UBound(results, 2) Then
        ReDim Preserve results(1 To 2, 0 To i)
        results(1, i) = vntWord
        results(2, i) = 1
      Else
        results(2, i) = results(2, i) + 1
      End If
    Next
  Next
End With
Sheet1.Range("C1").Resize(UBound(results, 2) + 1, 2) = Application.Transpose(results)
End Sub
This is about 40 times faster than above, 20 times faster if Application.screenupdating=False is used on the original code.
 
Using a dictionary object takes 1.5 times as long as the solution in my last message, but the coding is more straightforward:
Code:
Sub blah3()
starttime = Timer
Set dict = CreateObject("Scripting.Dictionary")
dict.CompareMode = vbTextCompare
Dim rngData As Range, rngCell As Range, vntWord As Variant
With Sheet5
  Set rngData = .Range("A2:A" & .Range("A65536").End(xlUp).Row)
  TotalCells = rngData.Cells.Count
  For Each rngCell In rngData.Cells
    For Each vntWord In Split(Replace(Replace(Replace(rngCell.Value, """", ""), "]", ""), "[", ""), " ")
      If dict.exists(vntWord) Then
      dict.Item(vntWord) = dict.Item(vntWord) + 1
      Else
      dict.Add vntWord, 1
      End If
    Next
  Next
End With
Sheet1.Range("G2").Resize(dict.Count, 2) = Application.Transpose(Array(dict.keys, dict.items))
Debug.Print Timer - starttime
End Sub
 
4 rows of text copied 256 times, containing 59 unique words.
It took about 5.5 secs with the original code.
2.76 secs with the original code and screenupdating turned off.
0.15 secs for macro blah above (screenupdating on/off made no signicant difference).
0.23 secs with blah3 (dictionary).

Excel 2003

You should remove the 2 timer lines rom blah3

What time improvement are you getting
 
@p45cal
Hi!
Interesting. Would you mind testing both procedures with 10K rows and 100K rows? Just to check how ratios evolve. Thank you.
Regards!
 
Times in Seconds
16.4k rows:
original (result sheet active) 80.45
original (source sheet active) 45.48 = 1.8 times as fast
collection 1.30 = 62 times as fast
dictionary 2.24 = 36 times as fast

65.5k rows:
original (result sheet active) 322.03
original (source sheet active) 179.45 = 1.8 times as fast
collection 4.95 = 65 times as fast
dictionary 8.53 = 38 times as fast

(only 65k rows in excel 2003)

re:
What time improvement are you getting
Sorry SirJB7, I thought your message was from the OP.
 
@p45cal
Hi!
Sorry I didn't read 2003 :( ... well, at least I didn't ask for a million rows! :p
Thank you very much. And new figures confirmed initial ratios. Very good job!
Regards!
 
Whow ! What an odd way to use the Dictionary object ‼​
Sorry ! 'cause I forgot to mention that less than 10 code lines are necessary​
to extract unique word and count the number of times each word appears :​
Code:
Sub DemoDictionary()
    Set Dict = CreateObject("Scripting.Dictionary")
          AR = Sheet5.Range("A2", Sheet5.Cells(Sheet5.Rows.Count, 1).End(xlUp)).Value
    For R& = 1 To UBound(AR):  Dict.Item(AR(R, 1)) = Dict.Item(AR(R, 1)) + 1:  Next
    Sheet1.[C2].Resize(Dict.Count, 2).Value = Application.Transpose(Array(Dict.Keys, Dict.Items))
    Set Dict = Nothing
End Sub
Pascal if you could test it with your protocol, thanks …​
My choice is Dictionary object for its easy coding, more properties & methods …​
In this demo the Dictionary is declared by Late Bindind.​
The code should be faster by declaring it by Early Binding
with the reference Microsoft Scripting Runtime
and this line at the beginning of the procedure : Dim Dict As New Dictionary
replacing line Set Dict = …​
Also Text Compare mode is slower than Binary Compare mode !​
Dictionary object cannot have duplicates, so no need to use functions like Replace, Split, …​
Just have a look to the Dictionary Item property's help, especially its remark …​
A very interesting article : Using the Dictionary Class in VBA
 
Pascal if you could test it with your protocol, thanks …
This was indeed much faster taking only 0.45 seconds to process 65k rows!
However, there were a few snags.
My choice is Dictionary object for its easy coding, more properties & methods …
Agreed.
In this demo the Dictionary is declared by Late Bindind.

The code should be faster by declaring it by Early Binding
with the reference Microsoft Scripting Runtime
and this line at the beginning of the procedure : Dim Dict As New Dictionary
replacing line Set Dict =
I didn't test this since this is only done once in the code, so I suspect time savings could only be negligible.

Also Text Compare mode is slower than Binary Compare mode!
Tests show a 1.9 times increase in speed, however, by using vbTextCompare I was duplicating/respecting the OP's results, which differentiate between upper & lower case.

Dictionary object cannot have duplicates, so no need to use functions like Replace, Split, …
This is the big one; your code looks at the entire contents of each cell, wheras the OP wants to count occurrences of words within each cell too. I haven't bothered trying to make this part of the OP's code speedier.

The techniques you used to speed things up:
Putting the source data into an array rather than referring to each cell in the sheet gave rise to a 1.9 times faster result.
Had I stuck with vbBinaryCompare, that would have given rise to a 1.24 times faster result, but not what the OP had.
Using the implicit method of adding to the dictionary rather than using If Exist then..Else gave rise to a 1.39 times increase in speed.

But, and contrary to my expectations, using the dictionary and all but the early binding and vbBinaryTextCompare, my original blah offering in msg#4 above which uses a collection is still 1.17 times faster than using the dictionary.

(Should I have been churlish and started this message with "Wow! What a way to use the Dictionary object to get the wrong results"?)
 
Thanks Pascal !​
Sorry but you're right, I didn't understand the original OP like that, I thought it was a word a cell.​
And without any sample file as we asked for …​
I recently read that Collection object could be faster than Dictionary object​
when there are more unique values than duplicates …​
 
@Marc L: you are right, and sorry if it's already too late. I'm attaching a sample of how the comments are set up. (It's basically 14 comments repeated many times for example purposes).
 

Attachments

  • text example.xlsx
    423.2 KB · Views: 41
So, for always the same comments, faster is to count them (a cell a comment) than words !​
Fisrt, a no Dictionary way using the MATCH worksheet function :​
Code:
Sub MatchComments()
    Sheet2.UsedRange.Clear
    ReDim SC&(1 To 1), ST$(1 To 1)
      AR = Sheet1.Range("I2", Sheet1.[I2].End(xlDown)).Value
    SC(1) = 1
    ST(1) = Left$(AR(1, 1), 255)
 
    For R& = 2 To UBound(AR)
        T$ = Left$(AR(R, 1), 255)
         P = Application.Match(T, ST, 0)
 
        If IsError(P) Then
          P = UBound(SC) + 1
          ReDim Preserve SC(1 To P), ST(1 To P)
          ST(P) = T
        End If
 
        SC(P) = SC(P) + 1
    Next
 
    Sheet2.[C2].Resize(UBound(SC), 2).Value = Application.Transpose(Array(SC, ST))
    Erase AR, SC, ST
    Sheet2.Activate
End Sub
1.016 second with the 60K rows of the text example workbook …​
Tests were just after boot time, no data in cache !​
Concern is MATCH function not working with string upper to 255 characters​
(checked on Excel version 2003 & 2007) …​
Not a problem with this sample file with only one comment upper the limit​
but could be an issue in other case …​
So to break this limit and to speed up the process, a Dictionary way :​
Code:
Sub DictionaryComments()
    Dim Dict As New Dictionary
'    Set Dict = CreateObject("Scripting.Dictionary")
     Sheet2.UsedRange.Clear
    AR = Sheet1.Range("I2", Sheet1.[I2].End(xlDown)).Value
    For R& = 1 To UBound(AR):  Dict.Item(AR(R, 1)) = Dict.Item(AR(R, 1)) + 1:  Next
    Application.ScreenUpdating = False
    Sheet2.Activate
    [C2].Resize(Dict.Count).Value = Application.Transpose(Dict.Items)
    For R = 1 To Dict.Count:  Cells(R + 1, 4).Value = Dict.Keys(R - 1):  Next
    Application.ScreenUpdating = True
    Dict.RemoveAll
    Set Dict = Nothing
End Sub
0.143 second !​
Little concern with the TRANSPOSE worksheet function also not working​
with string upper to 255 characters, fixed with a For Next loop …​
But it should be faster by using a control number (hash value) to compare comments :​
Code:
Sub DictionaryHashComments()
    Dim Dict As New Dictionary
'    Set Dict = CreateObject("Scripting.Dictionary")
     Sheet2.UsedRange.Clear
    ReDim CT&(0)
    AR = Sheet1.Range("I2", Sheet1.[I2].End(xlDown)).Value
 
    For R& = 1 To UBound(AR)
        HV = Dict.HashVal(AR(R, 1))
 
        If Not Dict.Exists(HV) Then
            ReDim Preserve CT(UBound(CT) + 1)
            CT(UBound(CT)) = R
        End If
 
        Dict.Item(HV) = Dict.Item(HV) + 1
    Next
 
    Application.ScreenUpdating = False
    Sheet2.Activate
    [C2].Resize(Dict.Count).Value = Application.Transpose(Dict.Items)
    For R = 1 To UBound(CT):  Cells(R + 1, 4).Value = AR(CT(R), 1):  Next
    Application.ScreenUpdating = True
    Dict.RemoveAll
    Set Dict = Nothing
End Sub
0.125 second !​
Dictionary key is not anymore the comment itself but its control number,​
less characters to compare, faster process indeed !​
Just need the CT array variable to store the comment's index in AR array …​
Like it ?​
 
Thanks shrivallabha.​
I just read that but I never saw that in my VBA coding experience …​
I rather prefer using Array & worksheet functions than Collection for simple needs​
and I directly jump to Dictionary for more specific ones …​
 
I tried collection couple of times [I think I had read it on J-Walk's blog] and thenceforth, I have used Dictionary. The feature that I liked in dictionary is that you can query a key using .Exists whereas collection doesn't have it. And dictionary is more like associative array.

I'd guess that it was perhaps more popular in pre-2007 days for finding uniques in VBA when one had to resort to either advanced filter or pivot in native Excel functionality. But it is just that, a plain guess and I do not have anything to back this statement [as dictionary existed then as well] and if I remember correctly that particular J-Walk's blog article is also pre-2007.

You can also try out ArrayList if you want to. It has some good methods for carrying out sorting. Might be good if we want to do sorting in memory than excel range sort.
 
@Marc L, shrivallabha
Hi!
I join
I just read that but I never saw that in my VBA coding experience …​
I rather prefer using Array & worksheet functions than Collection for simple needs​
and I directly jump to Dictionary for more specific ones …​

Regards!
 
@Pascal: just tried your first approach and it's WAY faster. It works perfectly.


This is about 40 times faster than above, 20 times faster if Application.screenupdating=False is used on the original code.



Just 1 thing though, I can't get it to work right with my progress bar user form.

I'm attaching a new file with everything including the userform. It works by running the "Run" code. Maybe you can identify what's wrong.

Thanks!!
 

Attachments

  • text example.xlsm
    480.2 KB · Views: 61
It was not getting over 54%?
32767/60480 (Counter/TotalCells) = .5417
See that 32767? Its the max value of an integer.

Change:
Dim Counter As Integer
to:
Dim Counter As Long

You don't need Application.Screenupdating = False/True; there is only one screen update per run of the macro.
But the progress bar is slowing everything down!

Consider having a variable:
Dim NextPctDone As Single

then adjust your code to:

Code:
    Next
    Counter = Counter + 1
    PctDone = Counter / TotalCells
    If PctDone >= NextPctDone Then
      UpdateProgressBar PctDone
      NextPctDone = NextPctDone + 0.05
    End If
  Next


where the .05 repesents 5% jumps (adjust to suit).
 
@cacos : consider a procedure (extracting over 69 000 datas from a 72 000 rows external file)​
which takes 48 seconds to execute with a progress bar userform just needs 18 seconds without ‼​

I just test the p45cal's blah procedure with your sample file : 5.787 seconds …​
As your original code should be corrected to remove commas, dots and others characters (linefeed !)​
to avoid duplicates (like words ever & you among others - up to 18) in result list sorted, time will increase.​
So it depends if you really need to separate the same 14 comments in words and count them​
or just directly count the number of times each comment appears (0.2 second max) …​
 
HA! how did I miss that integer! Thanks a lotttt

Also good idea about jumping 5 to 5, I'll be doing that I think. Thanks again Pascal!


@Marc L: you're right when counting whole comments the dictionary approach is awesome. I'm still going over it to fully understand everything as it's way past my vba knowledge. I'm going word by word now but it'll come handy pretty soon.

@everyone: Thanks a lot for jumping in, I just can't express how all these different coding and approaches contribute to my vba education. You guys really do magic and make people passionate about Excel :)
 
Back
Top