What is the best file format to store an uncompressed 2D matrix?

https://datascience.stackexchange.com/questions/9877

data-formats

16-10-2019
|

Question

For what it's worth my particular case is a symmetrical matrix, but this question should be answered more generally.

Solution

The most compatible format is surely CSV/TSV. It's text and you can usually Gzip it on the fly with the software package you are using. There is no widely standardized format for storing matrix array data. Matlab has its *.mat files, NumPy has *.npz, Stata and SAS have their own, ... Best just use a clear-text file.

If the matrix is symmetric, if it is very large or if there will be a lot of them, you could spare 50% in space requirement by storing only the lower (or upper) triangular part of it. If you chose to do so, there is, again, no widely accepted format. Just store the shape first and then the flattened, 1D data.

OTHER TIPS

I would go with .csv as it's universally accepted and can be read in different programming languages easily. Moreover you can simply open it with an office software. If you are using your matrix only in Python I also recommend Pickle library which writes your matrix in a .p format and can be easily read in Python with a simple load function.

As carriage return/line feed (CR/LF) might cause issues depending on the system, I suggest to store the matrix "inline" or in raster-style with a short header to specify your convention, a versioning number (you might change your mind later, and increase the version), etc. and at least the matrix size, and the rest in some text format (CSV,TSV). This is similar to what is done in the portable gray map or bitmap image format.

I used that to store text filter bank coefficients.

A minimum example could be: 2,3,0.1,1.2,2.3,3.4,4.5,5.6 for the $2\times 3$ matrix: \begin{array} 0.1&1.2&2.3\\3.4&4.5&5.6 \end{array} but you can use for instance #2,#3,0.1,1.2,2.3,3.4,4.5,5.6 so that aliens (think about the golden Pioneer plaque) understand that the first two integers are "different" and provide hints on how the following numbers should be read. With a square matrix (typical of symmetrical matrices), this is even more interesting, as you only need one header number #n (the side) and the acute readers will see that the remaining numbers are in $n^2$ quantity.

You can also have a look at other Matrix Storage Schemes, and if your matrix is sparse, Compressed Row Storage (CRS).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange