Question

I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience. I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML. The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.

I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server. Options I am considering,

  1. Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
  2. SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
  3. Python script to convert the CSV to XML.
  4. SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).

Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.

regards, -v-

PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.

Was it helpful?

Solution

Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.

OTHER TIPS

In Informatica PowerCenter you can import/export XML's +5GB.. as Marek response, try it because is work pretty fast.. here is a brief introduction if you are unfamiliar with this tool.

Create a process / script that will call a procedure to load csv files to external Oracle table and another script to load it to the destination table.
You can also add cron jobs to call these scripts that will keep track of incoming csv files into the directory, process it and move the csv file to an output/processed folder.
Exceptions also can be handled accordingly by logging it or sending out an email. Good Luck.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top