Reading and parsing large delimited text files in VB.net
-
07-03-2021 - |
Question
I'm busy with an applicaton which reads space delimited log files ranging from 5mb to 1gb+ in size, then stores this information to a MySQL database for later use when printing reports based upon the information contained in the files. The methods I've tried / found work but are very slow.
Am I doing something wrong? or is there a better way to handle very large text files?
I've tried using textfieldparser as follows:
Using parser As New TextFieldParser("C:\logfiles\testfile.txt")
parser.TextFieldType = FieldType.Delimited
parser.CommentTokens = New String() {"#"}
parser.Delimiters = New String() {" "}
parser.HasFieldsEnclosedInQuotes = False
parser.TrimWhiteSpace = True
While Not parser.EndOfData
Dim input As String() = parser.ReadFields()
If input.Length = 10 Then
'add this to a datatable
End If
End While
End Using
This works but is very slow for the larger files.
I then tried using an OleDB connection to the text file as per the following function in conjunction with a schema.ini file I write to the directory beforehand:
Function GetSquidData(ByVal logfile_path As String) As System.Data.DataTable
Dim myData As New DataSet
Dim strFilePath As String = ""
If logfile_path.EndsWith("\") Then
strFilePath = logfile_path
Else
strFilePath = logfile_path & "\"
End If
Dim mySelectQry As String = "SELECT * FROM testfile.txt WHERE Client_IP <> """""
Dim myConnection As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & strFilePath & ";Extended Properties=""text;HDR=NO;""")
Dim dsCmd As New System.Data.OleDb.OleDbDataAdapter(mySelectQry, myConnection)
dsCmd.Fill(myData, "logdata")
If Not myConnection.State = ConnectionState.Closed Then
myConnection.Close()
End If
Return myData.Tables("logdata")
End Function
The schema.ini file:
[testfile.txt]
Format=Delimited( )
ColNameHeader=False
Col1=Timestamp text
Col2=Elapsed text
Col3=Client_IP text
Col4=Action_Code text
Col5=Size double
Col6=Method text
Col7=URI text
Col8=Ident text
Col9=Hierarchy_From text
Col10=Content text
Anyone have any ideas how to read these files faster?
-edit- Corrected a typo in the code above
Solution
There are two potentially slow operations there:
- File reading
- Inserting lots of data into the db
Separate them and test which is taking the most time. I.e. write one test program that simply reads the file, and another test program that just inserts loads of records. See which one is slowest.
One problem could be that you are reading the whole file into memory?
Try reading it line by line with a Stream. Here is a code example copied from MSDN
Imports System
Imports System.IO
Class Test
Public Shared Sub Main()
Try
' Create an instance of StreamReader to read from a file.
' The using statement also closes the StreamReader.
Using sr As New StreamReader("TestFile.txt")
Dim line As String
' Read and display lines from the file until the end of
' the file is reached.
Do
line = sr.ReadLine()
If Not (line Is Nothing) Then
Console.WriteLine(line)
End If
Loop Until line Is Nothing
End Using
Catch e As Exception
' Let the user know what went wrong.
Console.WriteLine("The file could not be read:")
Console.WriteLine(e.Message)
End Try
End Sub
End Class
OTHER TIPS
From the top of my head id say try to impelement some kind of threading to spread the workload.