Directory file size calculation - how to make it faster?

https://stackoverflow.com/questions/2979432

24-10-2019
|

Question

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then do a recursive search.

I tried one another way to do this too : Using FSO (obj.GetFolder(path).Size). There's not much of difference in time in both these approaches.

Now the problem is, I have tens of thousands of files in a particular folder and its taking like atleast 2 minute to find the folder size. Also, if I run the program again, it happens very quickly (5 secs). I think the windows is caching the file sizes.

Is there any way I can bring down the time taken when I run the program first time??

Solution

If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...

.NET4.0 Code (or use 3.5 with TaskParallelLibrary)

    private static long DirSize(string sourceDir, bool recurse)
    {
        long size = 0;
        string[] fileEntries = Directory.GetFiles(sourceDir);

        foreach (string fileName in fileEntries)
        {
            Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
        }

        if (recurse)
        {
            string[] subdirEntries = Directory.GetDirectories(sourceDir);

            Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
            {
                if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    subtotal += DirSize(subdirEntries[i], true);
                    return subtotal;
                }
                return 0;
            },
                (x) => Interlocked.Add(ref size, x)
            );
        }
        return size;
    }

OTHER TIPS

Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)

The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.

some good reading: NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1

FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.

The short answer is no. The way Windows could make the directory size computation a faster would be to update the directory size and all parent directory sizes on each file write. However, that would make file writes a slower operation. Since it is much more common to do file writes than read directory sizes it is a reasonable tradeoff.

I am not sure what exact problem is being solved but if it is file system monitoring it might be worth checking out: http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx

I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.

I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.

PInvoke has a sample of how to use the API.

Peformance will suffer using any method when scanning a folder with tens of thousands of files.

Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.
Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.
How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.
You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.
I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).
Yes, the operating-system caches file access.

Suggestions

I wonder if BackupRead would help in your case?
What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)
What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.

Based on the answer by spookycoder, I found this variation (using DirectoryInfo) at least 2 times faster (and up to 10 times faster on complex folder structures!) :

    public static long CalcDirSize(string sourceDir, bool recurse = true)
    {
        return _CalcDirSize(new DirectoryInfo(sourceDir), recurse);
    }

    private static long _CalcDirSize(DirectoryInfo di, bool recurse = true)
    {
        long size = 0;
        FileInfo[] fiEntries = di.GetFiles();
        foreach (var fiEntry in fiEntries)
        {
            Interlocked.Add(ref size, fiEntry.Length);
        }

        if (recurse)
        {
            DirectoryInfo[] diEntries = di.GetDirectories("*.*", SearchOption.TopDirectoryOnly);
            System.Threading.Tasks.Parallel.For<long>(0, diEntries.Length, () => 0, (i, loop, subtotal) =>
            {
                if ((diEntries[i].Attributes & FileAttributes.ReparsePoint) == FileAttributes.ReparsePoint) return 0;
                subtotal += __CalcDirSize(diEntries[i], true);
                return subtotal;
            },
                (x) => Interlocked.Add(ref size, x)
            );

        }
        return size;
    }

With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.

So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.

It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)

Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.

I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)

Try this:

[StructLayout(LayoutKind.Sequential)]
public struct WIN32_FILE_ATTRIBUTE_DATA
{
    public uint fileAttributes;
    public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
    public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
    public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
    public uint fileSizeHigh;
    public uint fileSizeLow;
}

public enum GET_FILEEX_INFO_LEVELS
{
    GetFileExInfoStandard,
    GetFileExMaxInfoLevel
}

public class NativeMethods {
    [DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
    public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS  level, out WIN32_FILE_ATTRIBUTE_DATA data);

}

Now simply do the following:

WIN32_FILE_ATTRIBUTE_DATA data;
if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {

     long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow