Extracting numbers from text in excel [Case study]

Posted on June 19th, 2012 in Excel Howtos - 63 comments

Often we deal with data where numbers are buried inside text and we need to extract them. Today morning I had such task. As you know, we recently ran a survey asking how much salary you make. We had 1800 responses to it so far. I took the data to Excel to analyze it. And surprise! the numbers are a mess. Here is a sample of the data.

Extract numbers from text in Excel - How to?

Now, how do I extract the salary amounts from this without typing the values?

My first thought is to write a user defined function to extract the number from text. But I usually shy away from VBA. So I wanted to see if there is a formula based approach to extract the number from text.

Using formulas to extract number from text

Extracting numbers from text using Excel formulas - process

To extract number from a text, we need to know 2 things:

  1. Starting position of the number in text
  2. Length of the number

For example, in text US $ 31330.00 the number starts at 6th letter and has a length of 8.

So, if we can write formulas to get 1 & 2, then we can combine them in MID formula to extract the number from text!

Finding the starting position of number in text

To find the starting position, we need to find the first character which is a number (0 to 9). In other words, if we can find the positions of 0 to 9 inside the given text, then the minimum of all such positions would be starting position.

Sounds complicated?!? Well, in that case look at the formula and then you will understand why this works.

Assuming the text is in A1 and the range lstNumbers contains 0 to 9, below formula finds starting position


You need to array enter it (CTRL+SHIFT+Enter)

How this formula works?

FIND(lstNumbers, A1) portion: This part finds where each of the numbers 0 to 9 occur in the text in A1. If a match is found, the position is returned. Else we get an error. For US $ 31330.00 the values would be,


Meaning, 0 occurs at 10th position, 1 occurs at 7th position, 3 occurs at 6th position and everything else (2,4,5,6,7,8,9) do not occur in the number.

IFERROR(…,””) portion: Then, we replace errors with empty spaces so that MIN could work its magic.

At this stage, the result would be, {10;7;””;6;””;””;””;””;””;””}

Related: IFERROR Formula – syntax & examples

{=MIN(…)} portion: This would find the minimum of {10;7;””;6;””;””;””;””;””;””} which is 6. The starting position of number inside text.

Because we are finding multiple items, we need to array enter the formula to get correct result.

Finding the length of number

Once we find starting point, next we need to know the length of the number. There are many ways to do this. Depending on the variety in your input data, you can choose a technique that works best.

Approach 1 – counting number of digits in text

My first approach is to count number of digits in the text and use it as length. For this, we can break the text in to individual characters and then see if each of them is a number or not.

Assuming the text is in A1, the number of digits in it are,

=SUMPRODUCT(- -ISNUMBER(MID(A1,ROW($A$1:$A$200),1)+0))

MID(A1,ROW($A$1:$A$200),1) + 0 portion: This breaks the text in A1 in to individual characters (assumes the max length is 200) and then adds 0 to them.

At this stage, you have 200 values some of them numbers, others errors.

ISNUMBER(…) portion: This checks all the 200 values for numbers. After this, we will have 200 true or false values.

— ISNUMBER (…) portion: This converts the true, false values to 0s and 1s. (by double negating Excel will convert boolean values to number equivalents).

SUMPRODUCT(…) portion: This finally sums up all 1s thus giving us the number of digits in the text.

Does it work?

While this approach works well for some numbers, it fails in other cases. For example, a text like US $ 31330.00 has number portion with 8 characters (31330.00) where as our formula would say the length is 7 (because decimal point . is not a number and hence ISNUMBER() would give false for that).

So I had to move on to next approach.

Approach 2 – counting number of digits, commas & decimal points in text

The next approach is to count not only numbers, but also commas & decimal points in the text. For this, first I placed all the digits (0 to 9) and comma & decimal point in a range called as lstDigits.

Below formula counts how many of lstDigits are in text in A1.


COUNTIF(lstDigits, MID(…)) portion: This checks how many times each of the 200 characters appear in lstDigits.

This would be an array of counts. For example {0;0;0;0;0;1;1;1;1;1;1;1;1;…} for US $ 31330.00, indicating that first 5 are not in lstDigits and then we have 8 in lstDigits.

SUMPRODUCT(…) portion: just sums all the numbers, hence we get length as 8.

Related: SUMPRODUCT Formula – examples & explanation

Extract numbers from text in excel - results explained

Extracting numbers from text

Once we have starting position of number & its length, we can combine them in a MID formula to extract the number. Here is the result for our sample data set.

As you can see, this method works well, but fails in some cases like,

  • European number formats (, for decimal point and . for thousands)
  • Text with multiple numbers

Fortunately, in my data set, we had only a few incidents like these. So I have decided to manually adjust them than work out even more complicated formula.

Using Macros to extract numbers from text

As you can guess, we can use a simple macro (or UDF) to extract numbers from a given text. We will learn how to do this next week.

Download Example Workbook

Click here to download example workbook with all these formulas. Examine the formulas to understand how you can extract numbers from text in Excel.

How do you Extract numbers from Text?

Often I deal with data like this. I use a mix of techniques. Apart from the one mentioned above I also use,

  • getNumber() UDF to extract numbers from text (more on this next week)
  • Use SUBSTITUTE to clear formatting (replace dots with empty spaces and commas with dots to convert from European format to standard format)
  • Use VALUE to extract the number (works when number is shown as text)
  • Use +0 to force convert numbers from text (works when number is shown as text)

What about you? How do you extract numbers from text? What are your favorite techniques? Please share using comments.

Tips on cleaning data using Excel

If you use Excel to clean data, go thru these articles to learn some powerful techniques.

Written by Chandoo
Tags: , , , , , , , , , ,
Home: Chandoo.org Main Page
? Doubt: Ask an Excel Question

63 Responses to “Extracting numbers from text in excel [Case study]”

  1. Gregor Erbach says:

    I have learned one important lesson in my career: don't use Excel as a word processor!
    Excel is very ill-suited to handling tasks like pattern matching in text (regular expressions).
    For example, one could use a Perl script (from the CMD command line) to extract numbers from the text, something like
    perl -n -e "m/([\d\,\.]*)/; print "$1\n" IN > OUT
    The point I am trying to make is, if text processing is a substantial part of your tasks, it is worth the effort learning a tool that is more suited to the task.

    • Hui... says:

      I agree with what your saying about suitability of tools to task,

      but seriously perl -n -e “m/([\d\,\.]*)/; print “$1\n” IN > OUT

      Doesn't just roll off your tongue 

      and very few PC's would even have Perl installed

    • Chandoo says:

      Interesting point. I would have used VBA or some variation of that to clean data. But as Hui says Perl is not something many (including me) have on our computers. Plus I suck at regular expressions. I have tried long and hard to learn them, but I guess my mind is not wired to understand how they work 🙁

    • Jim says:

      The VBA engine for regular expressions works fine in Excel and Access. I agree if you are doing a substantial amount of text cleaning then doing so exclusively in Excel is probably not the best course, but using regular expressions in Perl or any other flavor doesn't really provide any advantage over using regex in Excel. At the end of the day, regex is regex (with the exception of minor syntactic differences).

      Chandoo, a lot of people are intimidated by regex just like people are intimidated by formulas or macros. Once you learn a handful of rules and concepts, you can really do some damage with regex, and not just for data cleaning. You can validate input, parse,  transform (e.g. firstname lastname --> lastname, firstname), etc. Once you pick it up you'll see many opportunities where it's useful and sometimes superior to another approach (and sometimes not). My brain is not wired to understand multiple nested formulas or complicated array formulas so I guess regex is just easier to me. I would add that the regular expression approach not only cleans this data, but is flexible enough to account for other variations...try it on the other records you have and see how well it performs. Either way, it's great to know there are multiple approaches to solving the problem!

  2. Bhavani Seetal Lal says:

    How to sign inn into chandoo.org???

  3. Vijay Kumar says:

    Dear Chandoo...
    Awesome, you are such a genius, how can you think like "what the excel will think"? Such ideas implementation...Very great . But lastly, i just wanted to know that in your formula =SUMPRODUCT(COUNTIF(lstDigits,MID(B4,ROW($A$1:$A$200),1))), why we can't use 1 on behalf of ROW($A$1:$A$200), obviously if we will input this formula somewhere in excel, it results 1. So what is the logic to use row function instead of simple 1.
    Please explain, i am eager to learn the logic of inputting or building a formulas. Can you suggest me the source like web url or books to build better formulas in our excel work

    • Luke M says:

      Within SUMPRODUCT, that bit of formula is actually producing an array of numbers from 1 to 200, eg
      To see this in the formula, highlight that portion within the SUMPRODUCT and hit F9. 

  4. Hui... says:

    In the first example above, I would store lstNumbers as a Named Formula:

    lstNumbers ={0;1;2;3;4;5;6;7;8;9}

  5. I do not know why you would not want to use a UDF - Here is the code for one I wrote that extracts Numbers, Letters, Commas and spaces to clean up data  - amend as you wish for ANY ASCII characters
    Function NumbersAndLettersOnly(instring) As String
    Dim StringLength As Integer ‘to hold string length
    Dim i As Integer ‘counter for loop
    Dim AsciiVal ‘to hold working character
    Dim WorkingString As String ‘to build output string
    instring = Trim(instring) ‘Drop leading & trailing spaces
    StringLength = Len(instring) ‘Count number of characters in string
    For i = 1 To StringLength ‘Loop thru each character in the string
    AsciiVal = Asc(Mid(instring, i, 1))
    If AsciiVal >= 48 And AsciiVal <= 57 Then 'Numbers 0-9
    WorkingString = WorkingString & Chr(AsciiVal)
    ElseIf AsciiVal >= 65 And AsciiVal <= 90 Then 'A-Z
    WorkingString = WorkingString & Chr(AsciiVal)
    ElseIf AsciiVal >= 97 And AsciiVal <= 122 Then 'a-z
    WorkingString = WorkingString & Chr(AsciiVal)
    ElseIf AsciiVal = 46 Then '.
    WorkingString = WorkingString & Chr(AsciiVal)
    ElseIf AsciiVal = 32 Then '{space}
    WorkingString = WorkingString & Chr(AsciiVal)
    End If
    Next i
    NumbersAndLettersOnly = WorkingString ‘Return output to function
    End Function

    • @Cliff,

      In case you might be interested, here is a shorter UDF which does the same thing as the one you posted...

      Function AlphaNumerics(ByVal InString) As String
        Dim X As Integer
        For X = 1 To Len(InString)
          If Mid(InString, X, 1) Like "[!A-Za-z0-9. ]" Then Mid(InString, X, 1) = Chr$(1)
        AlphaNumerics = Replace(InString, Chr$(1), "")
      End Function

      • Kevin says:

        I used this one, because it was the smallest but most readable (to me).

        The quotes didn't translate correctly into VBA when I cut and pasted, so I just deleted them and typed them in again. Also, this function does the opposite: it finds the search characters in the 1st set of quotes, and then substitutes a null character. In my case, I wanted the opposite, so I just added a NOT in front of Mid. Thanks Rick. 🙂

  6. Hi Purna,
    I had a similar request earlier this week. Here's the 'quick & dirty' solution I used. Keep in mind that, if there are a lot of different characters and text you need to remove from your text, this can get a little cumbersome. But  for removing  a few different text strings and characters from the same cell, this is pretty simple to understand.

    Using your example, to remove the space character , INR, and Rs, I would use this formula...


    You start with =SUBSTITUTE(B4," ","") and then wrap it in another SUBSTITUTE  for each text string you want to remove.

  7. Elias says:

    One more option that works withh all Excel versions.

    FIND(" ",A1&" ",MIN(FIND({0,1,2,3,4,5,6,7,8,9},A1&"0123456789")))

    Confirm with Cntrl+Shift+Enter


  8. Elias says:

    Other more with different approach.




  9. Okay, one more method. The following was posted some time ago by Lars-Åke Aspelin in some forum or newsgroup...
    This is an array formula and has to be confirmed with CTRL+SHIFT+ENTER rather than just ENTER.
    It has the following (known) limitations:
    - The input string in cell A1 must be shorter than 300 characters
    - There must be at most 14 digits in the input string.
       (Following digits will be shown as zeroes.)
    Maybe of no practical use, but it will also handle the following two cases correctly:
    - a "0" as the first digit in the input will be shown correctly in the output
    - an input without any digits at all will give the empty string as output (rather than 0).

  10. Jim says:

    I would probably use regular expressions in this case. The functions are:


    Function RegExReplace(ReplaceIn, _
            ReplaceWhat As String, ReplaceWith As String, Optional IgnoreCase As Boolean = False)
        Dim RE As Object
        Set RE = CreateObject("vbscript.regexp")
        RE.IgnoreCase = IgnoreCase
        RE.Pattern = ReplaceWhat
        RE.Global = True
        RegExReplace = RE.Replace(ReplaceIn, ReplaceWith)
    End Function

    Function RegExFind(FindIn, FindWhat As String, _
            Optional IgnoreCase As Boolean = False)
        Dim i As Long
        Dim matchCount As Integer
        Dim RE As Object, allMatches As Object, aMatch As Object
        Set RE = CreateObject("vbscript.regexp")
        RE.Pattern = FindWhat
        RE.IgnoreCase = IgnoreCase
        RE.Global = True
        Set allMatches = RE.Execute(FindIn)
        matchCount = allMatches.Count
        If matchCount >= 1 Then
            ReDim rslt(0 To allMatches.Count - 1)
            For i = 0 To allMatches.Count - 1
                rslt(i) = allMatches(i).Value
            Next i
            RegExFind = rslt
            RegExFind = ""
        End If
    End Function


    and the usage is:


    The expression \d{2,7}(\.)?\d{2,7}(\.\d{2}|,0{1,2})? is a little twisted, but these data are pretty dirty, too! Just using this expression: \d+ would match all but two of the sample records.


  11. Godsbod says:

    The obvious solution would have been to make a more structured questionnaire, with a seperate value field for currency and amount...

    but the challenge now arisiing is almost a worthy by-product...

  12. Juan Fernandez says:

    This formula can extract multiple numbers....


    Best Regards

  13. Carl says:

    I think the bigger problem is the one created by separating the numbers from the letters.  The goal was to figure out what people are making in their excel jobs, and now we have numbers but no reference to the currency, so you can't average these numbers to figure out how much the average excel-wise employee is making.  We'd have to go back and determine the currency and calculate the exchange rate.  I think @Godsbod had it right.  The more important lesson is to think through the questionnaire in order to get more manageable results, which is what we really want.

  14. [...] week we discussed how to extract numbers from text in Excel using formulas. In comments, quite a few people suggested that using VBA (Macros) to [...]

  15. r says:

    your formulas are very beautiful ... But I do not understand why to use a matrix of constants!
    use ROW(1:10)-1 and 1234567890 ... better!
    Also your formulas do not work in cases similar to this ABCD 1234 EFGH ... You need to make small changes ... Meanwhile I propose this is shorter and also works in these cases:
    =LEFT(REPLACE(A1,1,MIN(FIND(ROW(1:10)-1,A1&57321^2))-1,),FIND(" ",REPLACE(A1,1,MIN(FIND(ROW(1:10)-1,A1&57321^2))-1,)&" ")-1)
    ah ... 57321 is a pandigital number 🙂

    • Elias says:

      @r nice to see you around.
      Other than the number of characters I don’t see any advantages using the row option over the array of constants. However, using constants avoid the use of array enter and the formulas don’t get mess if you insert or delete rows.

  16. r says:

    you're right but I can not look at the formula
    is only an aesthetic factor ... impossible to say "beautiful" by a matrix of constants

  17. r says:

    i have changed your formula ... so it work with ABCD 1234 EFGH:
    =--MID(LEFT(A1,FIND(" ",A1&" ",MIN(FIND(ROW(1:10)-1,A1&57321^2)))-1),MIN(FIND(ROW(1:10)-1,A1&57321^2)),20)
    what do you think?

    • Elias says:


      I like that one, but it doesn't work with cases like this ABCD12345EFGH. So, what about this one?



  18. r says:

    oh @elias ... fantastic! i like it!
    and, why not so:

  19. r says:

    ufff ... so:

    • Elias says:

      Yea!!! I like the use of negative sign in 2 different places.


    • bashful says:

      Great! I was able to use this formula to extract the first number from HGVS (human genome variation society) nomenclature:

      c.247_248delCG 247

      p.Glu789_Arg790del 789

      The c. or p. are sometimes omitted and the numbers can be of variable length (1-12 digits).

      I would like to understand how this formula works. I basically know lookup, left, mid, right and find but this is over my head!

      Thank you!

  20. joel says:

    Mr. Chandoo,
    please ask you a help
    I want to have a code for Excel 2010, to list on a sheet all macros from a workbook excel
    please, help me in this

    thank you very much

  21. Raj says:

    I want to separate in each column like 25 PF 105 E01 XXXX , from below examples , please help or write srinivasmr@rediffmail.com



  22. Jim says:

    Hi Raj,

    Use the RegExReplace function I posted above and then use this:

    =RegExReplace(A1,"(^\d{2,3})(\w{2})(\d{3})(\w{3}(\w{4}))","$1 $2 $3 $5")

    where the text you want to parse is in cell A1. 

    And the result is:

    25 PF 105 XXXX
    100 SP 105 XXXX

    It makes the following assumptions:

    - The first part is at least 2 but nore than 3 digits
    - The second part is 2 characters
    - The third part consists of three digits
    - The fourth part consists of 4 alpha-numeric characters


  23. Rahim Zulfiqar Ali says:


  24. Santosh says:


    I have an issue where I have a strings of number and letter that are separted by comma. (Q123,125-129,QA123,127-129).

    I would like extend the numbers that are inside the hyphen so that it becomes

    Can this be completed using just the formulas.

  25. Vijay Verma says:

    Check this out...This seems to be the right thing to do..
    This says following -
    Try this

    A1: abc123def456ghi789

    First, create a Named Formula
    Names in Workbook: Seq
    Refers to: =ROW(INDEX($1:$65536,1,1):INDEX($1:$65536,255,1))

    This ARRAY FORMULA removes ALL non-numerics from a string
    B1: =SUM(IF(ISNUMBER(1/(MID(A1,seq,1)+1)),MID(A1,seq,1)*10^MMULT(-(seq<TRANSPOSE(seq)),-ISNUMBER(1/(MID(A1,seq,1)+1)))))

    In the example, the formula returns: 123456789

  26. antonio says:

    please, help
    I need code for Excel 2010 for list all the macros of book excel, in one sheet. name of the macro and sheet of the macro
    many thanks

    • Hui... says:


      Try the following:

      Sub ListMacros()
      ' Code thanx to Bob Philips
      Const vbext_pk_Proc = 0
      Dim VBComp As Object
      Dim VBCodeMod As Object
      Dim oListsheet As Object
      Dim StartLine As Long
      Dim ProcName As String
      Dim iCount As Integer

      Application.ScreenUpdating = False
      On Error Resume Next
      Set oListsheet = ActiveWorkbook.Worksheets.Add
      iCount = 1
      oListsheet.Range("A1").Value = "Macro"

      For Each VBComp In ThisWorkbook.VBProject.VBComponents
      Set VBCodeMod = ThisWorkbook.VBProject.VBComponents(VBComp.Name).CodeModule
      With VBCodeMod
      StartLine = .CountOfDeclarationLines + 1
      Do Until StartLine >= .CountOfLines
      oListsheet.[a1].Offset(iCount, 0).Value = .ProcOfLine(StartLine, vbext_pk_Proc)
      iCount = iCount + 1
      StartLine = StartLine + .ProcCountLines(.ProcOfLine(StartLine, vbext_pk_Proc), vbext_pk_Proc)
      End With
      Set VBCodeMod = Nothing
      Next VBComp

      Application.ScreenUpdating = True
      End Sub

      or have a look at: http://msdn.microsoft.com/en-us/library/office/dd890502%28v=office.11%29.aspx

  27. Justin says:

    This formula working perfect with European number format, 2 numbers, none numbers, large numbers:


    read more: http://www.excelhowto.com/how-to-find-the-first-number-in-a-text-string-use-excel-formula/

  28. Beth says:

    With the data I had, I was able to use your first approach with great success. In my case I did not have any commas or decimals to worry about; I was simply trying to extract a simple account number from a string of letter and numbers. So, after creating the named range for lstNumbers, I was able to use the starting position formula, the number of characters formula, and then used a MID function to get the values I needed.

    Thank you, thank you, thank you! 🙂

  29. Jimmy Luong says:

    This is great. Thanks a lot

  30. MF says:

    I encountered exactly the same situation recently, here's my approach to solve it.
    Ctrl Shift Enter

    Not working for all cases though, e.g. with more than 1 set of numbers

    Here's my post for sharing:

  31. Umer says:

    i am using formula
    to extract 61.49 from 61.49 mg but it is giving

    please guide what is wrong

  32. jpcpa says:

    Hi Sir. I just on how can I sn separate the name and TAx information number in a two columns given that data below.

    "SAVER'S DIGITAL HUB APPLINCE DEPOT (1 Condura ) VAT 215-003-620
    ""DIGITEL MOBILE PHIL. VMH SUN-0922-877-7139 VAT 215-398-626
    ""DIGITEL MOBILE PHIL. VMH SUN-0922-877-7139 VAT 215-398-626
    ""DAPO REST AND BAR VAT 217-078-338
    ""WILCON BUILDER'S DEPOT INC. (1 pc. Vinyl Adhesive WaterBase 1 Gallon &
    110 pc. Kent Vinyl WP Deluxe 6""X36"" Natural Oak-YHT) VAT 221-252-819
    ""AVEXSON CORPORATION VAT 221-764-997 "
    "AVEXSON CORPORATION (for ZDJ 514) VAT 221-764-997
    ""TRG AUTO GEM PARTS SUPPLY NV234-220-733"

    Thanks Sir.

  33. Martin says:

    Hi you commented that:
    As you can see, this method works well, but fails in some cases like,
    European number formats (, for decimal point and . for thousands)
    Text with multiple numbers

    Fortunately, in my data set, we had only a few incidents like these. So I have decided to manually adjust them than work out even more complicated formula.

    This is my problem, almost all my data is in this format and I need a formula to sort out this very issue...

  34. kimdom says:

    I just cant get myself to think excel logic. Somebody help reading the number of days from a list like:

    Nov15 PUTS (23 days)
    March15 TIKS (3 days)
    March1 TIKS (25 days)
    June11 TIKS (10 days)

    So the reader should return the following:


  35. Shlomi says:

    Good Day all,

    could not find a solution's on all above comment....maybe someone can help me with this ?

    i have in a cell string such as and i want to find if this string contains a number appear in other cell, lets say cell $A$1 = 3.
    I've able to do it, BUT
    when i have string such as: , it looks on the 3 (on $A$1) and returned as if this number is part of the string.
    But i want to check if it contain 13 NOT 3.

    I being thinking that maybe I should check first the Length on the A1 cell contain the number (3) and then check the cell with the string (
    could not figure out how to do it....

    Try formula SEARCH or even by VBA , Instr(A1,""), but with no success. 🙁

    any help will be appreciated!!!


  36. TMR says:

    Thank you very much for your wonderful lessons for excel beginners like me.
    I have a question, if you have time, can you take a look and help me.
    Everyday I have to download a list of products which are sold in my site.
    Unfortunately, the site keep adding number to my product number to make it unique value to the system.
    For example, my product number is A1.
    In the first day, if 2 items of A1 were sold, then the system will shows A1-1, A1-2.
    And sometimes, if I mistakenly use the description of the listing for A1-1 as a sample to list new listing then the model number will be come A1-1-2

    So I will have a list of A1, A1-1, A1-2, A1-1-2, A1-1-2-1... for example.
    I want to delete all the added parts after A1.
    Can you help me with this problem?

    Thank you very much for your time

    • Hui... says:

      I'll assume the A1-?? are in Column A
      Use a helper Column =Left(A1,2) then copy that down
      If The A1 can be A11 etc use =Left(A1,Find("-",A1)-1)
      then copy that down

      • TMR says:

        Thank you Hui for your reply.
        I see your answer and it really works with Product which has only 2 character like A1.
        But I also have some other products which has number: FA-3MF-BK
        or GFT-HJKU-25-BKJ
        So when I download the file from the system it always like this:

        In this case, what should I do?

Leave a Reply