Options for read-only binary flat-file storage using Python
-
20-09-2019 - |
Question
I have been tasked with setting up a flat-file SKU database for use on embedded devices with limited storage and processor speed.
Basically the data I need to store consists of the following:
SKU Description Location Price Qty
The file will consist of several million records.
The most important considerations are storage space and retrieval time. Records will only need to be retrieved by SKU and it will be read-only, so the file can be sorted by SKU.
I would like to access this data with Python. So my questions comes down to this.
Are there existing Python libraries that can provide this functionality for me, or do I need to roll my own?
If the answer comes down to roll my own, does anyone have a suggestions, or good references for doing so?
Solution
How about SQLite with Python bindings? It has a little more than you need, but it's standard software and well-tested.
OTHER TIPS
The old way would be to use a simple key/value data table like gdbm module. Python comes with support for that, but it's not built into the default Python installation on my machine.
In general, use SQLite. As others wrote, it comes standard with Python, and it's used in a lot of embedded systems already.
If the records are fixed length then you can use the bisect module. The file size / the record size gives the number of records in the file. The bisect search will do an O(log(n)) lookup in the file, and you'll need to write an adapter to test for equality. While I haven't tested it, here's a sketch:
import bisect
RECORD_SIZE = 50
class MatchFirst10Chars(object):
def __init__(self, word):
self.word = word
def __lt__(self, other):
return self.word < other[:10]
class FileLookup(object):
def __init__(self, f):
self.f = f
f.seek(0, 2)
self.size = f.tell() // RECORD_SIZE
def __len__(self):
return self.size
def __getitem__(self, i):
self.f.seek(i*RECORD_SIZE)
return self.f.read(RECORD_SIZE)
SKU = "123-56-89 "
f = open("data_file")
fl = FileLookup(f)
i = bisect.bisect(fl, MatchFirst10Chars(SKU))
You could additionally gzip the file and seek on a gzip'ped file, but that's a tradeoff for space vs. time that you'll have to test.
May I suggest cdb? (Python bindings: python-cdb.)
It's a format used for read-only data, like you have; it's basically 256 giant hash tables, each able to have a different number of buckets. The cool thing about cdb is that the file doesn't need to be loaded into memory; it's structured in a way that you can do lookups by just mmap
ing in the bits you need.
The cdb spec is a good read, not least because the lines are formatted to create a uniform right margin. :-D
How about HDF? If you don't need SQL and require fast access to your data, there's nothing faster... in Python... for numerical or structured data.
Take a look at the DatabaseInterfaces section on the Python wiki. It's comprehensive. There are a couple of "pure" Python options listed (like SnakeSQL), which are a tad nicer to deploy. And, of course, there's always Berkeley DB and the like, which are super lean & raw.
Honestly, SQLite will probably work fine for you. If you really need to eek out more performance, then you'd be looking at a record-based format like BDB.
A simple solution is CPickle. You can also find similar questions on SO.
A variation of Andrew Dalke's answer (so you can still use binary search to locate the SKU quickly) which may reduce the space requirements would be to have fixed sized records at the start of the file (one per SKU) and then all the Descriptions and Locations (as null terminated strings say)
You get to save space by not having to pad out the locations and descriptions to fixed length. Also you can save space if there are lots of duplicate locations
Here is an example: say you have
SKU 16 bytes
Description Variable length
Location Variable length
Price 4 bytes (up to $42949672.95)
Quantity 4 bytes (up to 4294967295)
offset SKU desc_off loc_off Price Quantity
0x00000000 SKU0000000000001 0x01f78a40 0x01f78a47 0x000003e8 0x000f4240
0x00000020 SKU0000000000002 0x01f78a53 0x01f78a59 ...
...
... # 999998 more records
...
0x01f78a40 Widget\x00
0x01f78a47 Head office\x00
0x01f78a53 Table\x00
0x01f78a59 Warehouse\x00