Question

I need to transfer the very large dataset (between 1-10 mil records, possibly much more) from a domain-specific language (whose sole output mechanism is a C-style fprintf statement) to Python.

Currently, I'm using the DSL's fprintf to write records to a flat file. The flat file looks like this:

x['a',1,2]=1.23456789012345e-01
x['a',1,3]=1.23456789012345e-01
x['a',1,4]=1.23456789012345e-01
y1=1.23456789012345e-01
y2=1.23456789012345e-01
z['a',1,2]=1.23456789012345e-01
z['a',1,3]=1.23456789012345e-01
z['a',1,4]=1.23456789012345e-01

As you can see the structure of each record is very simple (but the representation of the double-precision float as a 20-char string is grossly inefficient!):

<variable-length string> + "=" + <double-precision float>

I'm currently using Python to read each line and split it on the "=".

Is there anything I can do to make the representation more compact, so as to make it faster for Python to read? Is some sort of binary-encoding possible with fprintf?

Was it helpful?

Solution

Err.... How many times per minute are you reading this data from Python?

Because in my system I could read such a file with 20 million records (~400MB) in well under a second.

Unless you are performing this in a limited hardware, I'd say you are worrying too much about nothing.

>>> timeit("all(b.read(20) for x in xrange(0, 20000000,20)  ) ", "b=open('data.dat')", number=1)
0.2856929302215576
>>> c = open("data.dat").read()
>>> len(c)
380000172

OTHER TIPS

A compact binary format for serializing float values is defined in the basic encoding rules (BER). There they are called "reals". There are implementations of BER for Python available, but also not too hard to write. There are libraries for C as well. You could use this format (that's what it was designed for), or a variant (CER, DER). One such Python implementation is pyasn1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top