Converting large SAS dataset to hdf5

https://stackoverflow.com/questions/21667547

09-10-2022
|

Question

I have multiple large (>10GB) SAS datasets that I want to convert for use in pandas, preferably in HDF5. There are many different data types (dates, numerical, text) and some numerical fields also have different error codes for missing values (i.e. values can be ., .E, .C, etc.) I'm hoping to keep the column names and label metadata as well. Has anyone found an efficient way to do this?

I tried using MySQL as a bridge between the two, but I got some Out of range errors when transferring, plus it was incredibly slow. I also tried export from SAS in Stata .dta format, but SAS (9.3) exports in an old Stata format that is not compatible with read_stat() in pandas. I also tried the sas7bdat package, but from the description it has not been widely tested so I'd like to load the datasets another way and compare the results to make sure everything is working properly.

Extra details: the datasets I'm looking to convert are those from CRSP, Compustat, IBES and TFN from WRDS.

Solution

I haven't had much luck with this in the past. We (where I work) just use Tab separated files for transport between SAS and Python -- and we do it a lot.

That said, if you are on Windows, you can attempt to setup an ODBC connection and write the file that way.

OTHER TIPS

You might be interested in the dirty hack used in a fork of sas7bdat. It provides a read_sas method to read sas files into a pandas data frame.

original sas7bdat : http://git.pyhacker.com/sas7bdat

fork with read_sas : https://github.com/openfisca/sas7bdat

Improvements are welcome !

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow