Question

I have to archive a large amount of data off of CDs and DVDs, and I thought it was an interesting problem that people might have useful input on. Here's the setup:

  • The script will be running on multiple boxes on multiple platforms, so I thought python would be the best language to use. If the logic creates a bottleneck, any other language works.
  • We need to archive ~1000 CDs and ~500 DVDs, so speed is a critical issue
  • The data is very valuable, so verification would be useful
  • The discs are pretty old, so a lot of them will be hard or impossible to read

Right now, I was planning on using shutil.copytree to dump the files into a directory, and compare file trees and sizes. Maybe throw in a quick hash, although that will probably slow things down too much.

So my specific questions are:

  • What is the fastest way to copy files off a slow medium like CD/DVDs? (or does the method even matter)
  • Any suggestions of how to deal with potentially failing discs? How do you detect discs that have issues?
Was it helpful?

Solution

When you read file by file, you're seeking randomly around the disc, which is a lot slower than a bulk transfer of contiguous data. And, since the fastest CD drives are several dozen times slower than the slowest hard drives (and that's not even counting the speed hit for doing multiple reads on each bad sector for error correction), you want to get the data off the CD as soon as possible.

Also, of course, having an archive as a .iso file or similar means that, if you improve your software later, you can re-scan the filesystem without needing to dig out the CD again (which may have further degraded in storage).

Meanwhile, trying to recovering damaged CDs, and damaged filesystems, is a lot more complicated than you'd expect.

So, here's what I'd do:

Block-copy the discs directly to .iso files (whether in Python, or with dd), and log all the ones that fail.

Hash the .iso files, not the filesystems. If you really need to hash the filesystems, keep in mind that the common optimization of compression the data before hashing (that is, tar czf - | shasum instead of just tar cf - | shasum) usually slows things down, even for easily-compressable data—but you might as well test it both ways on a couple discs. If you need your verification to be legally useful you may have to use a timestamped signature provided by an online service, instead, in which case compressing probably will be worthwhile.

For each successful .iso file, mount it and use basic file copy operations (whether in Python, or with standard Unix tools), and again log all the ones that fail.

Get a free or commercial CD recovery tool like IsoBuster (not an endorsement, just the first one that came up in a search, although I have used it successfully before) and use it to manually recover all of the damaged discs.

You can do a lot of this work in parallel—when each block copy finishes, kick off the filesystem dump in the background while you're block-copying the next drive.

Finally, if you've got 1500 discs to recover, you might want to invest in a DVD jukebox or auto-loader. I'm guessing new ones are still pretty expensive, but there must be people out there selling older ones for a lot cheaper. (From a quick search online, the first thing that came up was $2500 new and $240 used…)

OTHER TIPS

Writing your own backup system is not fun. Have you considered looking at ready-to-use backup solutions? There are plenty, many free ones...

If you are still bound to write your own... Answering your specific questions:

  • With CD/DVD you first typically have to master the image (using a tool like mkisofs), then write image to the medium. There are tools that wrap both operations for you (genisofs I believe) but this is typically the process.

  • To verify the backup quality, you'll have to read back all written files (by mounting a newly written CD) and compare their checksums against those of the original files. In order to do incremental backups, you'll have to keep archives of checksums for each file you save (with backup date etc).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top