Question

I conduct a file search and there is exception list for directories, the problem is below code recursively iterates through all files on hard drives. It works but it is slow. Therefore, I need help to optimize its performance. Thanks in advance.

CFileFind finder;

    // build a string with wildcards
    CString strWildcard(directory);
    strWildcard += _T("\\*.*");

    // start working for files
    BOOL bWorking = finder.FindFile(strWildcard);

    while (bWorking)
    {
        bWorking = finder.FindNextFile();

        if (finder.IsDots())
            continue;

        // if it's a directory, recursively search it

        if (finder.IsDirectory())
        {
            CString str = finder.GetFilePath();
            if(NULL == m_searchExceptions.Find(str)){
                _recursiveSearch(str);
            }
            else{
                continue;
            }
        }
        //basic comparison, can be replaced by strategy pattern if complicated comparsion required (e.g. REGEX)
        if(0 == finder.GetFileName().CompareNoCase(m_searchPattern)){
            if(m_currentSearchResults.Find(finder.GetFilePath()) == NULL){
                m_currentSearchResults.AddHead(finder.GetFilePath());       
            }
        }
    }
Was it helpful?

Solution

I don't think you're going to be able to optimize performance here. You're going to be spending 80+% of your time inside FindFirstFile and FindNextFile here (windows API calls) no matter what you do in terms of optimization on your end.

I asked a similar question already and have yet to get an answer.

OTHER TIPS

Looks like your m_currentSearchResults is a list, and each time you find a file name you look it up if it is already in the list. In the case when you have lots of found files (say hundreds), this can become a bottleneck as it has O(N^2) complexity. If this is the case, consider using a CMap instead as it gives you O(log N) search (a set would be even more appropriate than a map, but you don't have this in MFC but you could also use the standard library's std::set instead).

How slow? Did you profile it? If you're recursively searching files on your hard disk it's extremely likely you're I/O bound and there's nothing you can do short of getting faster storage hardware (like solid state).

You're doing a general search for a file. There are a million products out there that do this well, and they all use indexing as an optimization. The weak link here is certainly your disk, not your code. Comparing 1,000,000 strings will take no time at all compared to the time it takes to enumerate 1,000,000 files on disk.

There are two fundamental issues on performance here: hard drive access and directory traversal. Both you may be able to optimize on.

Hard Drive Optimization

A hard drive at rest tends to stay at rest. A spinning cylinder likes to keep spinning. Thus said, the bottlenecks in hard drive accessing are starting it up, seek time and read time. Reducing the quantity of accesses and increasing the quantity of data per read will increase your performance.

Memory access is faster than hard drive access. So haul large chunks of data into memory, then search memory.

Optimizing Directory Search.

Imagine, if you would, a tree of "pages". Each node in the tree is a directory of zero or more directories or files. Unfortunately, in most OS's, this data structure is not optimized for efficient searching.

The ideal situation is to haul in all the relevant directories into memory then search them (in memory). Once the location of the file is known, random access to the file is relatively quick. The problem is reducing search time by only reading the relevant directories; i.e. reducing the number of irrelevant directory reads.

Most applications that perform file searching on a hard drive read the drive and create their own optimized data structure(s). This may not be optimal for huge hard drives with enormouse quantities of files or cases of few file searches.

If you can, tell the OS to keep as many directories in memory as possible.

Improving Performance: Reducing other applications.

For some applications, the perceived performance time depends on other applications that are running at the same time. Running a compiler and an internet search concurrently will slow down most other applications. So try eliminating other applications that are not necessary from running concurrently with yours. Also, investing rasing the priority of your application.

+1 for profile it first to be sure. Also, this seems like a problem that could also be solved using the Task Parallel Library - launch a task as you see each directory, and use all those cores on your CPU -

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top