Appropriate data structure for flat file processing?

https://stackoverflow.com/questions/305673

08-07-2019
|

Question

Essentially, I have to get a flat file into a database. The flat files come in with the first two characters on each line indicating which type of record it is.

Do I create a class for each record type with properties matching the fields in the record? Should I just use arrays?

I want to load the data into some sort of data structure before saving it in the database so that I can use unit tests to verify that the data was loaded correctly.

Here's a sample of what I have to work with (BAI2 bank statements):

01,121000358,CLIENT,050312,0213,1,80,1,2/

02,CLIENT-STANDARD,BOFAGB22,1,050311,2359,,/

03,600812345678,GBP,fab1,111319005,,V,050314,0000/

88,fab2,113781251,,V,050315,0000,fab3,113781251,,V,050316,0000/

88,fab4,113781251,,V,050317,0000,fab5,113781251,,V,050318,0000/

88,010,0,,,015,0,,,045,0,,,100,302982205,,,400,302982205,,/

16,169,57626223,V,050311,0000,102 0101857345,/

88,LLOYDS TSB BANK PL 779300 99129797

88,TRF/REF 6008ABS12300015439

88,102 0101857345 K BANK GIRO CREDIT

88,/IVD-11 MAR

49,1778372829,90/

98,1778372839,1,91/

99,1778372839,1,92

Solution

I'd recommend creating classes (or structs, or what-ever value type your language supports), as

record.ClientReference

is so much more descriptive than

record[0]

and, if you're using the (wonderful!) FileHelpers Library, then your terms are pretty much dictated for you.

OTHER TIPS

Validation logic usually has at least 2 levels, the grosser level being "well-formatted" and the finer level being "correct data".

There are a few separate problems here. One issue is that of simply verifying the data, or writing tests to make sure that your parsing is accurate. A simple way to do this is to parse into a class that accepts a given range of values, and throws the appropriate error if not, e.g.

public void setField1(int i) { if (i>100) throw new InvalidDataException... }

Creating different classes for each record type is something you might want to do if the parsing logic is significantly different for different codes, so you don't have conditional logic like

   public void setField2(String s)
   {
         if (field1==88 && s.equals ...

         else if (field2==22 && s 
   }

yechh.

When I have had to load this kind of data in the past, I have put it all into a work table with the first two characters in one field and the rest in another. Then I have parsed it out to the appropriate other work tables based on the first two characters. Then I have done any cleanup and validation before inserting the data from the second set of work tables into the database.

In SQL Server you can do this through a DTS (2000) or an SSIS package and using SSIS , you may be able to process the data onthe fly with storing in work tables first, but the prcess is smilar, use the first two characters to determine the data flow branch to use, then parse the rest of the record into some type of holding mechanism and then clean up and validate before inserting. I'm sure other databases also have some type of mechanism for importing data and would use a simliar process.

I agree that if your data format has any sort of complexity you should create a set of custom classes to parse and hold the data, perform validation, and do any other appropriate model tasks (for instance, return a human readable description, although some would argue this would be better to put into a separate view class). This would probably be a good situation to use inheritance, where you have a parent class (possibly abstract) define the properties and methods common to all types of records, and each child class can override these methods to provide their own parsing and validation if necessary, or add their own properties and methods.

Creating a class for each type of row would be a better solution than using Arrays.

That said, however, in the past I have used Arraylists of Hashtables to accomplish the same thing. Each item in the arraylist is a row, and each entry in the hashtable is a key/value pair representing column name and cell value.

Why not start by designing the database that will hold the data then you can use the entity framwork to generate the classes for you.

here's a wacky idea:

if you were working in Perl, you could use DBD::CSV to read data from your flat file, provided you gave it the correct values for separator and EOL characters. you'd then read rows from the flat file by means of SQL statements; DBI will make them into standard Perl data structures for you, and you can run whatever validation logic you like. once each row passes all the validation tests, you'd be able to write it into the destination database using DBD::whatever.

-steve

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow