Question

I have the following function as part of a larger program that crawls the contents of a provided path, and indexes any .htm or .html pages that it finds in the parent folder or any subfolders. My crawler function (below) is recursive, and seems to work just fine, UNTIL it reaches enters a subfolder that contains no items.

This seems to be a common problem that is often fixed by structuring the while loop as follows:

while ( false !== ($file = readdir($folder)) )

but this isn't working. The last line that gets output is 'The current crawler path is ...', and then the output just stops. I'm guessing the problem is the empty folder and the readdir function, but I don't know how to fix it. Can someone offer a suggestion?

Thanks

function crawlFolders($path)
{
    $prevPath = $path;  // variable to keep track of the previous file path
    chdir($path);
    $folder = opendir($path);

    echo "The current crawler path is ".$path."<br>";

    while ( false !== ($file = readdir($folder)) ) // read current directory item, then advance pointer
    {   
        if ( is_file($file) )
        {   echo "File found!  The crawler is inspecting to see if it can be indexed<br>";
            if ( canIndex($path."/".$file) )
                indexPage($path."/".$file);
        }

        else if ( is_dir($file) ) 
        {
            //it's a folder, we must crawl
            if ( ($file != ".") && ($file != "..") )    //it's a folder, we must crawl
            {
                echo "$file is a folder<br><br>";
                crawlFolders($path."/".$file);
                chdir($prevPath); // change the working dir back to that of the calling fn

            }
        }   
    }
    closedir($folder);

}

After looking at this some more, I can't see why readdir is causing the problem. I think the problem may be that my crawlFolders function is not unwinding itself, and is instead just ending when it reaches the deepest, empty folder. Am I missing something with the way the recursion should work? I was under the impression that the recursive function calls would exit once the while loop returned false, thus dropping me to the previous crawlFolders function that made the recursive call (i.e. unwinding itself).

Do I need to return a value each time crawlFolders exits, so that the calling function knows where to resume itself?

It definitely seems like the recursion is the problem. I placed a file in the empty folder and my indexer worked, but the functions still didn't unwind as I wanted. I know this isn't happening because there are still two files in the starting path that weren't evaluated.

Was it helpful?

Solution

The problem isn't the recursion but very likely the current working directory.

You change the current directory using chdir() and then with $file you give a relative filename to is_file() and is_dir(). After the execution returns from the recursion the current directory is still the subdirectory so is_file($file) and is_dir($file) won't find the files.

You have to save the current directory before you go into the recursion or - better - avoid chdir() altogether and work with full paths: is_file($path . '/' . $file)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top