Compressed array of file paths and random access

https://stackoverflow.com/questions/12185961

29-06-2021
|

Question

I'm developing an file management Windows application. The program should keep an array of paths to all files and folders that are on the disk. For example:

0 "C:"  
1 "C:\abc"  
2 "C:\abc\def"  
3 "C:\ghi"  
4 "C:\ghi\readme.txt"

The array "as is" will be very large, so it should be compressed and stored on the disk. However, I'd like to have random access to it:

to retrieve any path in the array by index (e.g., RetrievePath(2) = "C:\abc\def")
to find index of any path in the array (e.g., IndexOf("C:\ghi") = 3)
to add a new path to the array (indexes of any existing paths should not change), e.g., AddPath("C:\ghi\xyz\file.dat")
to rename some file or folder in the database;
to delete existing path (again, any other indexes should not change).
For example, delete path 1 "C:\abc" from the database and still have 4 "C:\ghi\readme.txt".

Can someone suggest some good algorithm/data structure/ideas to do these things?

Edit:
At the moment I've come up with the following solution:

0 "C:"
1 "[0]\abc"
2 "[1]\def"
3 "[0]\ghi"
4 "[3]\readme.txt"

That is, common prefixes are compressed.

RetrievePath(2) = "[1]\def" = RetrievePath(1) + "\def" = "[0]\abc\def" = RetrievePath(0) + "\abc\def" = "C:\abc\def"

IndexOf() also works iteratively, something like that:

IndexOf("C:") = 0
IndexOf("C:\abc") = IndexOf("[0]\abc") = 1
IndexOf("C:\abc\def") = IndexOf("[1]\def") = 2

To add new path, say AddPath("C:\ghi\xyz\file.dat"), one should first add its prefixes:
```
5 [3]\xyz
6 [5]\file.dat
```
Renaming/moving file/folder involves just one replacement (e.g., replacing [0]\ghi with [1]\klm will rename directory "ghi" to "klm" and move it to the directory "C:\abc")
DeletePath() involves setting it (and all subpaths) to empty strings. In future, they can be replaced with new paths.
After DeletePath("C:\abc"), the array will be:
```
0 "C:"
1 ""
2 ""
3 "[0]\ghi"
4 "[3]\readme.txt"
```

The whole array still needs to be loaded in RAM to perform fast operations. With, for example, 1000000 files and folders in total and average filename length of 10, the array will occupy over 10 MB.
Also, function IndexOf() is forced to scan array sequentially.

Edit (2): I just realised that my question can be reformulated:
How I can assign each file and each folder on the disk unique integer index so that I will be able to quickly find file/folder by index, index of the known file/folder, and perform basic file operations without changing many indices?

Edit (3): Here is a question about similar but Linux-related problem. It is suggested to use filename and content hashing to identify file. Are there some Windows-specific improvements?

Solution

Your solution seems decent. You could also try to compress more using ad-hoc tricks such as using a few bits only for common characters such as "\", drive letters, maybe common file extensions and whatnot. You could also have a look on tries ( http://en.wikipedia.org/wiki/Trie ).

Regarding your second edit, this seems to match the features of a hash table, but this is for indexing, not compressed storage.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow