Fastest method for running a binary search on a file in C?

https://stackoverflow.com/questions/1727208

19-09-2019
|

Question

For example, let's say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste of time to copy the entire file into an array and then run binary search...I've effectively made it a linear time algorithm, because I'll have to spend O(n) time copy the darn file before I can run my search.

Is there a faster way to do this? Is there maybe something like lseek which works with lines instead of bytes?

If there isn't, am I better off just doing a linear search instead (assuming I'm only running the search once for the entire duration of my program) ?

Solution

You cannot seek by line. It's pretty obvious once you think about it.

But you can do a sort-of binary search on a text file.

What you do is:

Stat the file to get the length or seek to the end and get the position.
Memory map the file.
(This is best, I think, but you can use lseek and read if you must.)
Seek to the middle of the file, minus your average line length. Just guess.
Scan forward for a newline, unless you are at position 0.
Read your line and compare.
Repeat for 1/4th or 3/4ths, 1/8th, 1/16th, etc.

OTHER TIPS

A disk-based binary search needs to be, at least initially, "block-aware", i.e. aware of the fact that whether you read a single byte of a whole bunch, the I/O cost are the same. The other think it need to be aware is of the relative higher cost for a seek operation as compared to a sequential read operation.

Several of the ways that it can use this awareness about the characteristics of disk I/O:

Towards the end of the search, favor linear searching (scanning) rather than seeking into.
In the beginning check both the first and last element in the block, this may help extrapolate a better guess for the next split
Cache a tree (or even short flat list), of some of the items found in various places in the file (a bit like the intermediate nodes in a formal btree structure)
Declare and use an appropriate buffer size

If the file is small, like under a few hundred kilobytes, it's almost certainly faster to read (or virtually memory map) the entire file into memory. This is because the overhead of doing several i/o operations to seek and transfer is much worse than just reading the whole file, which is what most programs do and most operating systems assume is done.

Unless all the lines are the same length, or have a very predictable length, there's no easy way to seek to line #n. But, to perform a binary search, I'd work with byte offsets in the binary search and read, say 100 bytes (if the words are all less than 100 characters long) before and after the offset—a total of 200 bytes. Then scan for the newline before and after the middle of it to extract the word.

Yes you can lseek but it would help if the size of each word/number per line is fixed, if that is not the case, which is more likely, then you have to lseek by the size of file and seek to the nearest word beginning to still achieve close to the typical O(log n) time complexity of binary searches.

There wouldn't be a "lseek" function, because the file commands do not have the concept of a "line" This concept exists in a different layer of abstraction then the raw file commands.

As to whether it's faster or not, the answer will depend upon a number of factors, including the size of the file, the disk drive speed, and the amount of RAM available. If it isn't a large file, my guess is it would be faster to load the entire file into memory.

If it is a large file, I would use the binary search algorithm to narrow it down to a smaller range (say, a couple of megabytes), then load up that entire block.

As mentioned above, since the file is a text file, predicting the byte at which a given line begins within the file can't be done reliably. The ersatz binary search idea is a pretty good one. But it really won't save you a ton unless the file is huge, given how fast sequential I/O is nowadays and how slow random I/O is.

As you mention, if you are going to read it in, you might as well linearly search it as you go. So do so, use a modified Boyer-Moore search as you read it in and you'll do pretty well.

There are so many performance tradeoffs here that it's impossible to know what makes sense until you have measurements on typical data.

If you're going to maintain this code, it needs to be simple. If searches are rare or the file is small, go with linear search. If the cost actually matters, you'll have to do some experiments.

The second thing I would try after linear search would be to mmap the file and scan through it for newlines. This does take linear time, but strchr can be very fast. It helps if you can guarantee the file ends in a newline. Once you have the lines demarcated, you can keep the number of comparisons small by doing a binary search.

Another option you should consider is Boyer-Moore string search. This is a sub-linear time search and depending on the size of the search pattern, it may be faster than the logarithmic binary search. Boyer-Moore is especially good with long search strings.

Finally, if you determine binary search is really good, but that identifying the lines is a performance bottleneck, you could precompute the start location of each line and store these precomputed locations in binary format in an auxiliary file.

I feel comfortable making only one prediction: it is almost certainly worth avoiding reading in one line at a time with something like readline() or fgets(), because this strategy invariably involves calling malloc() to hold the contents of the line. The cost of calling malloc() on every line is likely to swamp any cost of search or comparison.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow