Domanda

We will have about 200 files (csv, excel, PDF, screen scrape) that all need to go into a SQL db. So there will be a unique procedure for most data sources

Only 2 ideas we have so far are:

  1. Write code to programmatically load each data source in as needed and write code to insert as needed (this is the simple way but maybe most time consuming)

  2. Write an XML map for each file that maps column from the source to columns/tables of the destination SQL DB.. But then writing code to interpret this custom XML mapping file could get complex?

Any other tools or methods we should consider? I though maybe SSIS could help somehow ? This seems to be the type of project BizTalk was made for right? But that is too expensive..

È stato utile?

Soluzione

As Pondlife mentioned, in real world solutions, programmatic solutions usually become more and more difficult to maintain and support as the full complexity of the requirements is uncovered. This is often not obvious up front.

I would choose a good ETL tool - SSIS is usually the best choice at present on the balance of typical criteria. Then you need to budget an amount of man-days to work through each input. Probably the quickest you will achieve is 0.5 man-days per file (including design, build and unit testing) for a very simple input.

You can save some time by copying your first package as a starting point for the others.

With "raw" inputs like this I typically start each package by just loading the data unaltered into a Staging table. At this point I load every column as unicode text. Then subsequent Data Flows or packages can pick that data up and deliver it. This approach really speeds debugging, testing and auditing - once you trust your file load you can query the Staging table with SQL.

BTW the SSIS package is in fact an XML document that describes the input, transformation and output requirements - similar to your point 2.

Altri suggerimenti

There is no correct way universally. It only matters what's easier for your specific situation. I'd go with the path of least resistance here. This means if some files are easier to map with xml (probably csv, excel and such) I'd use xml mapping for those. For others where xml mapping doesnt work I'd go with something else.

The reality is that some methods work better with one type of data sources and other methods work better with other type.

Perl. Just hack at each file type to produce a delimited file suitable to bcp load into the database. Often as not you can use regular expressions in Perl to grab stuff even out of XML files, but if you know XML and the inputs really are well formed, Perl has lots of parsers to turn proper XML into proper data. ;-)

Perl on Windows will automate Excel through Ole, too. Been there, done that, works about as well as can be expected. Save the file as text. Maybe iterate over it to fix it up. Repeat as necessary.

I don't agree that this sort of thing isn't amenable to programming, as someone else suggested. It's not perfectable, but errors can be reduced asymptotically, which isn't the case with a manual process.

Keep your scripts, inputs, and outputs all in different directories. That way you can use Perl (or whatever) to count the files and verify the transformations. If you're careful with your names and extensions, it will be easy to see what remains to be done. Make your scripts to everything, soup to nuts, including loading the database, so that you can re-run it whenever you want idempotently. Very satisfying after you notice a problem in the data in the database that can only be fixed by tweaking the parser.

Happy hacking.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top