• Hi All

    Please note that at the Chandoo.org Forums there is Zero Tolerance to Spam

    Post Spam and you Will Be Deleted as a User

    Hui...

  • When starting a new post, to receive a quicker and more targeted answer, Please include a sample file in the initial post.

Import text from huge text file into sheet columns

I have text file with around 2GB size, containing around 1,05,00,000 lines. I want to import all data into excel sheet and split into columns (since it wont fit in single column). I think assuming above data, it will into around 10 columns. (I need to process that data further to remove duplicate rows).

So I just want to import that data into sheet columns using macro.
 
Last edited:
Do you want to import all the rows to xl as same is not designed to store such large database.

My recommendations-
Either use Access or other software to filter the data in small parts.
Yes, after some operations, like replace and remove duplicate, I hope lines will reduce to around 60 K. At the end only I will save spreadsheet.
 
I am trying this. Thank you!
It did hang somewhere in between. Also tried to manually import data using Data tab -> From Text. It says file is larger, so imported data that fits in available rows. So next time when I start to import from (10 lakh rows + 1), it imports only single rows instead of all data that fit into that column.
 
Could you post a sample text file containing around 200 rows of data? Also identify the basis for working out duplicates. Make sure that the sample file contains few such cases.
 
Hmm, version of Excel?

When dealing with large data (from DB or text). There are few things you can try.

1. If you have Excel 2010 or later with PowerQuery/Get&Transform and x64 installation of Excel. Use PQ to import data and do all transformation in the editor and then only load data once you have the subset you want to keep. You can either keep the connection, or kill it to save file size (if you kill the connection, I'd suggest keeping template copy where data is kept as connection only. So you can easily replicate steps).

2. Download free PowerBI Desktop, then do clean up in it and copy paste the result (too bad, they've removed export to Excel from Desktop, you need PowerBI Pro subscription for that now...).
https://www.microsoft.com/en-us/download/details.aspx?id=45331

3. Use Command Line, PowerShell etc to split text file into multiple files of smaller size. See link for some of standard batch file code, PowerShell script.
https://stackoverflow.com/questions...smaller-multiple-text-file-using-command-line
https://gallery.technet.microsoft.com/scriptcenter/PowerShell-Split-large-log-6f2c4da0

You can use one of Open Source tool or commercial tool to split them as well.
Google and you should find plenty of tools out there.
Then import and clean each file separately and combine, then do final clean up at the end.

4. Load to DB and do data clean up there. Then read from DB into Excel
 
I am using Excel 2013. Tried to install power pivot...trim Add-ons but I could not see, that tab.
Yeah, I got batch file which splits file into multiple having 10 lakh lines in each. Then am using replace all and remove duplicate for each column using VBA. Thank you, I will try rest options.
 
Could you post a sample text file containing around 200 rows of data? Also identify the basis for working out duplicates. Make sure that the sample file contains few such cases.
I want to remove all text before a string test: including string. Also want to remove text after space.

Some content from file:

file1:test:src/draw/no/ind/erefe.c/gegegewg 1
filesdsd:test:src/draw/no/ind/reryeryrey.c/jhrtjtrj 1
fileskdk:sdsD:test:abc/disp/no/ind/ryretyrey.txt/Display::Restrict 20

Desired output:

src/draw/no/ind/erefe.c/gegegewg
src/draw/no/ind/reryeryrey.c/jhrtjtrj
abc/disp/no/ind/ryretyrey.txt/Display::Rest

Text between above 2 delimiter can be duplicate in same file. For 10 lakh lines, 30k are unique and if there are 140 lakh lines, unique count is around 60 k only.
 
Test this code which will read data in memory and paste at last. Test it first with data where you know only 30K unique entries exist.
Code:
Public Sub ReadTextFileData()
Dim intFF As Integer
Dim strInput As String
Dim objDict As Object
intFF = FreeFile
'\\ Change text file path here
Open "C:\TEMP\New Text Document.txt" For Input As #intFF
Set objDict = CreateObject("Scripting.Dictionary")
objDict.CompareMode = vbTextCompare
Do While Not EOF(intFF)
    Line Input #intFF, strInput
    strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "test:", , vbTextCompare) + 5, Len(strInput)), " ")(0))
    If Not objDict.exists(strInput) Then
        objDict.Add strInput, strInput
    End If
Loop
Close #intFF
'\\ Last line will error if Keys count is greater than 64 K
Range("A1").Resize(objDict.Count, 1) = Application.Transpose(objDict.Keys)
End Sub
 
Last edited:
Test this code which will read data in memory and paste at last. Test it first with data where you know only 30K unique entries exist.
Code:
Public Sub ReadTextFileData()
Dim intFF As Integer
Dim strInput As String
Dim objDict As Object
intFF = FreeFile
'\\ Change text file path here
Open "C:\TEMP\New Text Document.txt" For Input As #intFF
Set objDict = CreateObject("Scripting.Dictionary")
objDict.CompareMode = vbTextCompare
Do While Not EOF(intFF)
    Line Input #intFF, strInput
    strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "test:", , vbTextCompare) + 5, Len(strInput)), " ")(0))
    If Not objDict.exists(strInput) Then
        objDict.Add strInput, strInput
    End If
Loop
Close #intFF
'\\ Last line will error if Keys count is greater than 64 K
Range("A1").Resize(objDict.Count, 1) = Application.Transpose(objDict.Keys)
End Sub

Will you please elaborate about last comment which is about keys count?
Also is there is any limit of lines in input file for this VBA?
 
Will you please elaborate about last comment which is about keys count?
Also is there is any limit of lines in input file for this VBA?
Yes, the limit is not on input lines but on the "Application.Transpose" part. See below discussion:
https://stackoverflow.com/questions/39736261/excel-vba-array-transpose-size-limit

But if in principle if the code suits you then last part can be tweaked to handle large sized data. So verify that it works as you need and then we can look ahead.
 
Yes, the limit is not on input lines but on the "Application.Transpose" part. See below discussion:
https://stackoverflow.com/questions/39736261/excel-vba-array-transpose-size-limit

But if in principle if the code suits you then last part can be tweaked to handle large sized data. So verify that it works as you need and then we can look ahead.

Vba is working fine and that too, fast, saved my lots of time.
But it's printing output like this:
Code:
1:src/draw/no/ind/erefe.c/gegegewg
1:src/draw/no/ind/reryeryrey.c/jhrtjtrj
1:abc/disp/no/ind/ryretyrey.txt/Display::Rest
I can remove it using replace all though. Thanks a lot again!
And I hope I will have unique count less than 64 K.
 
Sorry, I can't reproduce the result you are producing using data posted in post 11:
For
Code:
file1:test:src/draw/no/ind/erefe.c/gegegewg 1
filesdsd:test:src/draw/no/ind/reryeryrey.c/jhrtjtrj 1
fileskdk:sdsD:test:abc/disp/no/ind/ryretyrey.txt/Display::Restrict 20
I get
Code:
src/draw/no/ind/erefe.c/gegegewg
src/draw/no/ind/reryeryrey.c/jhrtjtrj
abc/disp/no/ind/ryretyrey.txt/Display::Restrict
Have you edited the macro at your end?
 
Sorry, I can't reproduce the result you are producing using data posted in post 11:
For
Code:
file1:test:src/draw/no/ind/erefe.c/gegegewg 1
filesdsd:test:src/draw/no/ind/reryeryrey.c/jhrtjtrj 1
fileskdk:sdsD:test:abc/disp/no/ind/ryretyrey.txt/Display::Restrict 20
I get
Code:
src/draw/no/ind/erefe.c/gegegewg
src/draw/no/ind/reryeryrey.c/jhrtjtrj
abc/disp/no/ind/ryretyrey.txt/Display::Restrict
Have you edited the macro at your end?
Yes, I have changed text file name and string from test: to MN1201:
 
Yes, I have changed text file name and string from test: to MN1201:
Then you need to change this part:
Code:
strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "test:", , vbTextCompare) + 5, Len(strInput)), " ")(0))
5 corresponds to length of test:
so for "MN1201:" it will be 7 characters
Code:
strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "MN1201:", , vbTextCompare) + 7, Len(strInput)), " ")(0))
 
Then you need to change this part:
Code:
strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "test:", , vbTextCompare) + 5, Len(strInput)), " ")(0))
5 corresponds to length of test:
so for "MN1201:" it will be 7 characters
Code:
strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "MN1201:", , vbTextCompare) + 7, Len(strInput)), " ")(0))
I see, ok, thanks again!
 
Here's revised without Transpose just in case if you encounter items more than 65536. I am not sure if it will be slower or faster but on any side there should not be big performance hit as still we will be doing most part in the computer memory.
Code:
Public Sub ReadTextFileData()
Dim intFF As Integer
Dim strInput As String
Dim objDict As Object
Dim varRng As Variant, k As Variant
Dim i As Long
 intFF = FreeFile
'\\ Change text file path here
Open "C:\TEMP\New Text Document.txt" For Input As #intFF
Set objDict = CreateObject("Scripting.Dictionary")
 objDict.CompareMode = vbTextCompare
Do While Not EOF(intFF)
  Line Input #intFF, strInput
  strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "test:", , vbTextCompare) + 5, Len(strInput)), " ")(0))
  If Not objDict.exists(strInput) Then
  objDict.Add strInput, strInput
  End If
Loop
Close #intFF
'\\ Prepare an array of the size dictionary keys
varRng = Range("A1").Resize(objDict.Count, 1).Value
i = 1
For Each k In objDict.Keys
  varRng(i, 1) = k
  i = i + 1
Next k
Range("A1").Resize(objDict.Count, 1).Value = varRng
Set objDict = Nothing
End Sub
 
Here's revised without Transpose just in case if you encounter items more than 65536. I am not sure if it will be slower or faster but on any side there should not be big performance hit as still we will be doing most part in the computer memory.
Code:
Public Sub ReadTextFileData()
Dim intFF As Integer
Dim strInput As String
Dim objDict As Object
Dim varRng As Variant, k As Variant
Dim i As Long
intFF = FreeFile
'\\ Change text file path here
Open "C:\TEMP\New Text Document.txt" For Input As #intFF
Set objDict = CreateObject("Scripting.Dictionary")
objDict.CompareMode = vbTextCompare
Do While Not EOF(intFF)
  Line Input #intFF, strInput
  strInput = Trim(Split(Mid(strInput, InStrRev(strInput, "test:", , vbTextCompare) + 5, Len(strInput)), " ")(0))
  If Not objDict.exists(strInput) Then
  objDict.Add strInput, strInput
  End If
Loop
Close #intFF
'\\ Prepare an array of the size dictionary keys
varRng = Range("A1").Resize(objDict.Count, 1).Value
i = 1
For Each k In objDict.Keys
  varRng(i, 1) = k
  i = i + 1
Next k
Range("A1").Resize(objDict.Count, 1).Value = varRng
Set objDict = Nothing
End Sub
Thank you so much again!
Just one query, will I come to know if there is unique count >65 K. I guess, will get error when it exceeds limit.
And in that (when i realize that issue), will use your second macro.
 
Back
Top