Reading and parsing large delimited text files in VB.net

https://stackoverflow.com/questions/8243538

07-03-2021
|

Question

I'm busy with an applicaton which reads space delimited log files ranging from 5mb to 1gb+ in size, then stores this information to a MySQL database for later use when printing reports based upon the information contained in the files. The methods I've tried / found work but are very slow.

Am I doing something wrong? or is there a better way to handle very large text files?

I've tried using textfieldparser as follows:

Using parser As New TextFieldParser("C:\logfiles\testfile.txt")
    parser.TextFieldType = FieldType.Delimited
    parser.CommentTokens = New String() {"#"}
    parser.Delimiters = New String() {" "}
    parser.HasFieldsEnclosedInQuotes = False
    parser.TrimWhiteSpace = True
    While Not parser.EndOfData
        Dim input As String() = parser.ReadFields()
        If input.Length = 10 Then
            'add this to a datatable
        End If
    End While
End Using

This works but is very slow for the larger files.

I then tried using an OleDB connection to the text file as per the following function in conjunction with a schema.ini file I write to the directory beforehand:

Function GetSquidData(ByVal logfile_path As String) As System.Data.DataTable
    Dim myData As New DataSet
    Dim strFilePath As String = ""
    If logfile_path.EndsWith("\") Then
        strFilePath = logfile_path
    Else
        strFilePath = logfile_path & "\"
    End If
    Dim mySelectQry As String = "SELECT * FROM testfile.txt WHERE Client_IP <> """""
    Dim myConnection As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & strFilePath & ";Extended Properties=""text;HDR=NO;""")
        Dim dsCmd As New System.Data.OleDb.OleDbDataAdapter(mySelectQry, myConnection)
        dsCmd.Fill(myData, "logdata")
        If Not myConnection.State = ConnectionState.Closed Then
            myConnection.Close()
        End If
    Return myData.Tables("logdata")
End Function

The schema.ini file:

[testfile.txt]
Format=Delimited( )
ColNameHeader=False
Col1=Timestamp text
Col2=Elapsed text
Col3=Client_IP text
Col4=Action_Code text
Col5=Size double
Col6=Method text
Col7=URI text
Col8=Ident text
Col9=Hierarchy_From text
Col10=Content text

Anyone have any ideas how to read these files faster?

-edit- Corrected a typo in the code above

Solution

There are two potentially slow operations there:

File reading
Inserting lots of data into the db

Separate them and test which is taking the most time. I.e. write one test program that simply reads the file, and another test program that just inserts loads of records. See which one is slowest.

One problem could be that you are reading the whole file into memory?

Try reading it line by line with a Stream. Here is a code example copied from MSDN

Imports System
Imports System.IO

Class Test
    Public Shared Sub Main()
        Try
            ' Create an instance of StreamReader to read from a file.
            ' The using statement also closes the StreamReader.
            Using sr As New StreamReader("TestFile.txt")
                Dim line As String
                ' Read and display lines from the file until the end of
                ' the file is reached.
                Do
                    line = sr.ReadLine()
                    If Not (line Is Nothing) Then
                        Console.WriteLine(line)
                    End If
                Loop Until line Is Nothing
            End Using
        Catch e As Exception
            ' Let the user know what went wrong.
            Console.WriteLine("The file could not be read:")
            Console.WriteLine(e.Message)
        End Try
    End Sub
End Class

OTHER TIPS

From the top of my head id say try to impelement some kind of threading to spread the workload.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow