Pandas reading csv into hdfstore thrashes, creates huge file

https://stackoverflow.com/questions/22541645

18-06-2023
|

Question

As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
     store.append('df',chunk)
store.close()

It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe. If I read the csv in without chunking and then add it to the store, I have no problems.

Update 1: I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04. The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.

Update 2: Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.

Update 3: The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?

Solution

The wide object columns are definitely the problem. My solution has been to truncate the object columns while reading them in. If I truncate to a width of 20 characters, the h5 file is only about twice as large as a csv file. However, if I truncate to 100 characters, the h5 file is about 6 times larger.

I include my code below as an answer, but if anyone has any idea how to reduce this size disparity without having to truncate so much text, I'd be grateful.

store = pd.HDFStore(filepath, 'w')
for chunk in pd.read_csv(f, chunksize=5000, sep='\t',
                         na_values="null", error_bad_lines=False):

    chunk = chunk.apply(truncateCol)
    store.append(table, chunk)

def truncateCol(ser, width=100):
    if ser.dtype == np.object:
        ser = ser.str[:width] if ser.str.len().max() > width else ser
    return ser

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow