Question

I have a physical Directory structure as :

Root directory (X) -> many subdirectory in side root (1,2,3,4..) -> In each sub dir many files present.

Photos(Root)    
            ----        
               123456789(Child One)
                 ----
                     1234567891_w.jpg (Child two)
                     1234567891_w1.jpg(Child two)
                     1234567891_w2.jpg(Child two)
                     1234567892_w.jpg (Child two)
                     1234567892_w1.jpg(Child two)
                     1234567892_w2.jpg(Child two)
                     1234567893_w.jpg(Child two)
                     1234567893_w1.jpg(Child two)
                     1234567893_w2.jpg(Child two)
                     -----Cont      
              232344343(Child One)      
              323233434(Child One)      
              232323242(Child One)      
              232324242(Child One)      
              ----Cont..

In database I have one table having huge number of names of type "1234567891_w.jpg".

NOTE : Both number of data in database and number of photos are in lacs.

I need an effective and faster way to check the presence of each name from database table to the physical directory structure.

  • Ex : Whether any file with "1234567891_w.jpg" name is present in physical folder inside Photos (Root).*

Please let me know if I miss any information to be given here.

Update :

I know how to find a file name existance in a directory. But I am looking for an efficient way, as it will be too much resource consuming to check each filename (from lacs of record) existance in more than 40 GB data.

Was it helpful?

Solution 2

It might sound funny or Might be I was unclear or did not provide much information..

But from the directory pattern I got one nice way to handle it is :

AS the probability of existance of the file name is only in one location and that is :

Root/SubDir/filename

I should be using :

File.Exists(Root/SubDir/filename);

i.e - Photos/123456789/1234567891_w.jpg

And I think this will be O(1)

OTHER TIPS

You can try to group data from the database based on the directory in which they are. Sort them somehow (based on the filename for instance) and then get the array of files within that directory string[] filePaths = Directory.GetFiles(@"c:\MyDir\");. Now you only have to compare strings.

it would seem the files are uniquely named if that's the case you can do something like this

var fileNames = GetAllFileNamesFromDb();
var physicalFiles = Directory.GetFiles(rootDir, 
                                        string.Join(",",fileNames),
                                        SearchOptions.AllDirectories)
                                        .Select(f=>Path.GetFileName(f));
var setOfFiles = new Hashset<string>(physicalFiles);
var notPresent = from name in fileNames
                 where setOfFiles.Contains(name)
                 select name;
  • First get all the names of the files from the datatbase
  • Then search for all the files at once searching from the root and including all subdirectories to get all the physical files
  • Create a Hashset for fast lookup
  • Then match the fileNames to the set those not in the set are selected.

the Hashset is basically just a set. That is a collection that can only incude an item once (Ie there's no duplicates) equality in the Hashset is based on HashCode and the lookup to determine if an item is in the set is O(1).

This approach requires you to store a potentially hugh Hashset in memory and depending on the size of that set it might affect the system to an extend where it's no longer optimizing the speed of the application but passes an optimum instead.

As is the case with most optimizations they are all trade offs and the key is finding the balance between all the trade offs in the context of the value the application is producing for the end user

Unfortunately their is no magic bullet which you could use to improve your performance. As always it will be a trade off between speed and memory. Also their are two sides which could lack on performance: The database site and the hdd drive i/o speed.

So to gain speed i would in a first step improve the performance of the database query to ensure that it can return the names for searching fast enough. So ensure that your query is fast and also maybe uses (im MS SQL case) keywords like READ SEQUENTIAL in this case you will already retrieve the first results while the query is still running and you don't have to wait till the query finished and gave you the names as a big block.

On the other hdd side you can either call Directory.GetFiles(), but this call would block till it iterated over all files and will give you back a big array containing all filenames. This would be the memory consuming path and take a while for the first search, but if you afterwards only work on that array you get speed improvements for all consecutive searches. Another approach would be to call Directory.EnumerateFiles() which would search the drive on the fly by every call and so maybe gain speed for the first search, but their won't happen any memory storage for the next search which improves memory footprint but costs speed, due to the fact that their is no array in your memory which could be searched. On the other hand the OS will also do some caching if detects that you iterate over the same files over and over again and some caching occurs on a lower level.

So for the check on hdd site use Directory.GetFiles() if the returned array won't blow your memory and do all your searches on this (maybe put it into a HashSet to further improve performance if filename only or full path depends on what you get from your database) and in the other case use Directory.EnumerateFiles() and hope the best for some caching done be the OS.

Update

After re-reading your question and comments, as far as i understand you have a name like 1234567891_w.jpg and you don't know which part of the name represents the directory part. So in this case you need to make an explicit search, cause iteration through all directories simply takes to much time. Here is some sample code, which should give you an idea on how to solve this in a first shot:

string rootDir = @"D:\RootDir";

// Iterate over all files reported from the database
foreach (var filename in databaseResults)
{
    var fullPath = Path.Combine(rootDir, filename);

    // Check if the file exists within the root directory
    if (File.Exists(Path.Combine(rootDir, filename)))
    {
        // Report that the file exists.
        DoFileFound(fullPath);
        // Fast exit to continue with next file.
        continue;
    }

    var directoryFound = false;

    // Use the filename as a directory
    var directoryCandidate = Path.GetFileNameWithoutExtension(filename);
    fullPath = Path.Combine(rootDir, directoryCandidate);

    do
    {
        // Check if a directory with the given name exists
        if (Directory.Exists(fullPath))
        {
            // Check if the filename within this directory exists
            if (File.Exists(Path.Combine(fullPath, filename)))
            {
                // Report that the file exists.
                DoFileFound(fullPath);
                directoryFound = true;
            }

            // Fast exit, cause we looked into the directory.
            break;
        }

        // Is it possible that a shorter directory name
        // exists where this file exists??
        // If yes, we have to continue the search ...
        // (Alternative code to the above one)
        ////// Check if a directory with the given name exists
        ////if (Directory.Exists(fullPath))
        ////{
        ////    // Check if the filename within this directory exists
        ////    if (File.Exists(Path.Combine(fullPath, filename)))
        ////    {
        ////        // Report that the file exists.
        ////        DoFileFound(fullPath);

        ////        // Fast exit, cause we found the file.
        ////        directoryFound = true;
        ////        break;
        ////    }
        ////}

        // Shorten the directory name for the next candidate
        directoryCandidate = directoryCandidate.Substring(0, directoryCandidate.Length - 1);
    } while (!directoryFound
              && !String.IsNullOrEmpty(directoryCandidate));

    // We did our best but we found nothing.
    if (!directoryFound)
        DoFileNotAvailable(filename);
}

The only furhter performance improvement i could think of, would be putting the directories found into a HashSet and before checking with Directory.Exists() use this to check for an existing directory, but maybe this wouldn't gain anything cause the OS already makes some caching in directory lookups and would then nearly as fast as your local cache. But for these things you simply have to measure your concrete problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top