How to search a file system efficiently (algorithm-wise)?

https://stackoverflow.com/questions/12831668

06-07-2021
|

문제

Depth-first search is a horrible way to search a file system -- in practice, a file that might be below a directory very close to the root can take a long time to be found with DFS, because DFS gets distracted by another deep, irrelevant hierarchy of directories.
Its resource requirements, however, are very good -- the number of file handles it needs to keep open is only proportional to the depth of the hierarchy, not its size.

Breadth-first search is the obvious solution -- it is extremely fast.
(Last time I measured, it took around the same time as DFS on my system, about ~8 seconds.)

Yet BFS has its own problems -- BFS requires keeping open a very large number of directory handles, potentially in the millions. (On my system, it's around 100,000 handles, which is ridiculously high. It could easily be more.)

This causes several issues:

Keeping open so many handles consumes not only memory (which is relatively cheap anyway), but many other kinds of resources, such as handles to files inside virtual file systems (networks, mounted directories, etc.), and possibly other limited, kernel-level resources.
It also causes other practical problems for the user: For example, a virtual directory which is being kept open is no longer closable! This implies, for example, that the user might not be able to close a program, eject some device, or close some sort of external connection. All sorts of issues can arise with this approach.

It might seem like iterative deepening, then, is the solution.

The problem? It's very slow in practice.
I trouble is that large directories (such as WinSxS in Windows) are re-enumerated for every depth level, even though they don't need to be. The last time I tried this, iterative deepening was slower than DFS on my system by a factor of ~15. So an 8-second seach took around 120 seconds or so, which is not acceptable.

Of course, attempting to keep track of directories you shouldn't open (perhaps because you noticed you don't need to open them anymore) defeats the purpose of using iterative deepening in the first place by uncovering all the resource problems we had with BFS.

So, the question is simple enough:

If you're searching a file system you're not familiar with, how should you proceed to achieve an acceptable balance between speed and usability? Is there a better choice than BFS?

해결책

If you really have no guidance whatsoever about where the file might be, then I don't think there's a lot you can do. You should try to minimize seeks and seek time with some tricks, but filesystems get fragmented and there's no way for you to know about that, so it's hard to do much there. Searching the files in a directory before searching a subdirectory should be faster on a lot of filesystems, especially if you're looking for a small file that may have been inlined. Not exhausting kernel resources with a full BFS is also a good thing.

Even if you only have a hint of where the file may be, that can help a lot. For example, if it's a file that the user put somewhere and then forgot the location of, then start with the home directory, temp directory, and root of the drive, and do a DFS up to a reasonable recursion limit (eg. 6-8 would find any manually-placed file or automatically downloaded file on my Windows or OS X machines) on the theory that users typically don't end up with deep trees accidentally, but automatically-generated hierarchies can get deep. If that search fails, go back and search the deep directories you skipped earlier. If the file is that lost, the search is going to be slow anyways, so fall back on DFS to be safe and not cause too many problems as the user continues to use the machine.

And most importantly, if the system has any kind of search index, check that first, even if it means writing a lot more code to support it.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow