Ideal options for archiving flat files

https://stackoverflow.com/questions/511797

21-08-2019
|

Question

We receive multiple thousands of flat files per week currently, and I have a system that runs reports on these and exports them to PDF for our people to process and reference.

I currently bulk load these into a database, make sure all fields/formatting is valid, export them, and truncate the tables on the next run.

What I'm wondering is what everyone thinks would be the most space efficient way to store possibly 6 months of this bulk load plain text data?

Either in the form of daily SQL backups, or zipped archives, or whatever, so I always had the ability to reload old data for trouble shooting.

Any ideas are welcome, I'm open to any suggestions.

Solution

So, you bulk-load flat files of raw data, you use SQL Server 2005 to process them and get a separate bunch of processed flat files, and then dump the data?

Well, if this is correct, SQL backups won't help since you seem to be saying the data doesn't stay in the DB. Your only option is efficient compression of the input and/or output files coupled with good organization of the batches in directories.

I would recommend an aggressive compression program, that has scheduled batch functionality, but be careful to not get to esoteric with the program you use for the sake of avoiding being locked in to one program...

OTHER TIPS

Use a recent generation compression utility (7z and rar compression are great) and compress into bundles after organizing everything so it's easy to find.

There are SDK's for 7zip that work with .net to make this easy.

-Adam

There are two types of data post-analysis:

original data (usually very big)
derived data (usually smaller)

In your case, the derived data might be the data that goes into your reports. For your original data I'd just make a huge, compressed archive file of it with a systematic name based on the date and the type of data. The value of this is that if some newbie on your team somehow totally obliterates the code that imports your original data into the database, you can recover from it. If the derived data is small, you might think about copying that to either another database table, or keeping it in a separate flat file because some of your problems could be solved by just getting to the derived data.

Backing up your data in general is a tricky problem, because it depends on things like:

Amount of data throughput
Available space for off-site backups
Value of upgrading your backup system versus just resigning yourself to regenerating data if problems happen.

What's your setup like? Will hard drives grow fast enough to hold the compressed version of your data? Have you thought about off-site backups?

Construct a file hierarchy that organizes the files appropriately, zip the whole directory, and use the -u flag on zip to add new files.after you archive them, you can delete the files, but preserve the directory structure for the next batch to be added.

If the file names encode the version somehow (dates or whatever) or are otherwise unique it doesn't need to be anything fancier than a signle directory. If not, you need to set up your directories to let you recover versions.

Compress them and save them in a binary field in the database. Then you can build a "reload data-set" button to do bring in your dataset (i'm assuming you keep track of each dataset that you import to replace it, etc.)

This way, everything's stored in the database, and backed up with the database, indexed and linked correctly, and compressed at the same time.

You've indicated that you'd like to avoid SDKs and installing software on remote systems.

Your options are pretty limited.

Since you are using windows computers, why not use a simple script?

This question offers several suggestions on how to use windows VBscript to compress and decompress files:
Can Windows' built-in ZIP compression be scripted?

Nothing to 'install', no SDKs. Just copy the script over, call it via the scheduler, and you're all set.

-Adam

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow