Question

I have a huge directory of about 500k jpg files, and I'd like to archive all files that are older than a certain date. Currently, the script takes hours to run.

This has a lot to do with the very piss-poor performance of GoGrid's storage servers, but at the same time, I'm sure there's a way more efficient way Ram/Cpu wise to accomplish what I'm doing.

Here's the code I have:

var dirInfo = new DirectoryInfo(PathToSource);
var fileInfo = dirInfo.GetFiles("*.*");
var filesToArchive = fileInfo.Where(f => 
    f.LastWriteTime.Date < StartThresholdInDays.Days().Ago().Date
      && f.LastWriteTime.Date >= StopThresholdInDays.Days().Ago().Date
);

foreach (var file in filesToArchive)
{
    file.CopyTo(PathToTarget+file.Name);
}

The Days().Ago() stuff is just syntactic sugar.

OTHER TIPS

The only part that I think you could improve is the dirInfo.GetFiles("*.*"). In .NET 3.5 and earlier, it returns an array with all the file names, which takes time to build and uses lots of RAM. In .NET 4.0, there is a new Directory.EnumerateFiles method that returns an IEnumerable<string> instead, and fetches results immediately as they are read from the disk. This could improve performance a bit, but don't expect miracles...

You should consider using a third party utility to perform the copying for you. Something like robocopy may speed up your processing significantly. See also https://serverfault.com/questions/54881/quickest-way-of-moving-a-large-number-of-files

I'd keep in mind the 80/20 rule and note that if the bulk of the slowdown is file.CopyTo, and this slowdown far outweighs the performance of the LINQ query, then I wouldn't worry. You can test this by removing the file.CopyTo line and replacing it with a Console.WriteLine operation. Time that versus the real copy. You'll find the overhead of GoGrid versus the rest of the operation. My hunch is there won't be any realistic big gains on your end.

EDIT: Ok, so the 80% is the GetFiles operation, which isn't surprising if in fact there are a million files in the directory. Your best bet may be to begin using the Win32 API directly (like FindFirstFile and family) and P/Invoke:

[DllImport("kernel32.dll", CharSet=CharSet.Auto)]
static extern IntPtr FindFirstFile(string lpFileName, 
    out WIN32_FIND_DATA lpFindFileData);

I'd also suggest, if possible, altering the directory structure to decrease the number of files per directory. This will improve the situation immensely.

EDIT2: I'd also consider changing from GetFiles("*.*") to just GetFiles(). Since you're asking for everything, no sense in having it apply globbing rules at each step.

You could experiment with using (a limited number of) Threads to perform the CopyTo(). Right now the whole operation is limited to 1 core.

This will only improve performance if it is now CPU-bound. But if this runs on a RAID, it may work.

Take a listen to this Hanselminutes podcast. Scott talks to Aaron Bockover the author of Banshee media player, they ran in to this exact issue and talk about it at 8:20 in the podcast.

If you can use .Net 4.0 then use their Directory.EnumerateFiles as mentioned by Thomas Levesque. If not then you may need to write your own directory walking code like they did in Mono.Posix using the native Win32 APIs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top