Question

I need to read in two large files (over 125 MB). Each file contains records that have similar data. I need to find the records that are in both of them and then if the fields of the records dont match I need to overwrite the records in file two with the fields that are contained in the records from file one.

For example the first file has the following fields:

ID, ACCT, Bal, Int, Rate 

The second file has the following fields:

TYPE, ID, ACCT, Bal, Int, Rate.  

So if a record in file 1 has the same ACCT number as a record in file 2 then the Bal, Int, and Rate in file 2 need to be overwritten with the value of Bal, Int, and Rate from file 1.

Some of the records won't be in each file. The output file I need to create is all the records from file two and if the record is not also in file one then it will write to the file as is, but then the records that need to be changed will then be included.

I have tried many different options but most are not efficient enough to deal with the large files. What is the proper direction to take with this problem? Thanks in advance for any help.

Was it helpful?

Solution

Load all records from file 1 into a hash table with ACCT as key.
Loop over all records in file 2 and update if needed.

Complexity: O(n)

HTH

OTHER TIPS

Define two type specific classes, one for each file.

class FileOne
{ 
    public int LineNumber {get;set};
    public int Id{get;set;}; 
    public double Bal {get;set;};
...
}

class FileTwo
{ 
    public int LineNumber {get;set};
    public string TranType{get;set;};  // type = reserved word
    public int Id{get;set;}; 
    public double Bal {get;set;};
...
}

Load your file into an IList<> for each file so you have IList myFileOne and IList myFileTwo, and capture the line number of each entry so you know where they appear in the file.

Now use linq to query the differences between the two:

var diffs = from f1 in myFileOne
            join f2 in myFileTwo on f1.Id = f2.Id
            where f1.Bal != f2.Bal // add whatever conditions you need here
            select new {
                f1.Id, f2.Bal, f2.Int, f2.Rate, f1.LineNum
            }

Diffs will become an enumerable collection of the 4 fields in the select. Now you can iterate through that and using f1.LineNum find the right line number from myFileOne and update it with the values found in f2.

Does that help or were you more interested in how to access the file itself?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top