Question

I'm looking to apply a KMP (or similar) search to a large file (> 4GB).

I'm expecting this to give me problems though.I can't copy it all to memory because there isn't enough space there.

My question is, what is the best way to go about doing this search? Should I simply create a FILE* and do the search directly in the file, should I copy blocks (say 4k) to memory and search those, or something else completely?

Was it helpful?

Solution

If you are using a platform that supports it, you can use mmap(). Pagination of the file is also a possibility, but remember to keep the buffer as large as possible to reduce the IO overhead, and to be careful between boundaries of two pages (suppose a string is matching, but is splitted by the page boundary)

Alternatively, I suggest you to build an index of some sort, and use the index to restrict the search. KMP search is not particularly efficient. This of course depends on the nature of your file, how it gets created, etc.

OTHER TIPS

For the file access I would recommend to use memory mapped file to avoid the data copy. It is trivial on unix machines. You may have to split the file mapping into smaller blocks if it can't be allocated in one block. I can provide some code if you are interested.

For the search I would recommend using the Boyer More search algorithm.

Searching directly in the file would be very slow, using buffering will give much better performance. But note that your buffer has to be bigger than what you search (SearchLength), of course, and you have to refresh the buffer when being SearchLength bytes before its end.

Best approach is to read it in blocks and search that. You should make the block size a parameter, so you can experiment with what gives the best performance.

However, it is usually more efficient to try and index the file in some way so that you don't have to linearly search through the whole file. For example, KMP is a string searching algorithm - are you just looking for occuences of a word? Then you can just create a hash table (on disk) of the words and their location in the file and have very efficient search.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top