I am using Pandas to store, load, and manipulate financial data. A typical data file is a 6000x4000 DataFrame (6000 stocks x 4000 trading dates) which if say half the stocks have value N/A on given date, will be of size 200MB in CSV format. I have been using a workstation that has 16GB of memory, which has been sufficient for loading entire CSVs of this size into memory, performing various calculations, and then storing the results. In a typical day I end up using about 10GB of RAM during peak usage. I have the feeling that I could be doing things more efficiently though. I would like to get that number down to around 2GB so I can run my daily update of several models using a regular laptop with 4GB of RAM. Is this reasonable? Am I using too much memory as is regardless of my hardware?
I understand the answer to the above depends upon the details of what I am doing. Here is an example of the type of function I might run:
def momentum_strategy():
# prices.csv is a matrix containing stock prices for 6000 stocks
# and 4000 trading dates
prices = pd.read_csv("prices.csv")
# Daily stock returns
returns = prices/prices.shift(1) -1
# Annualized return volatility
volatility = pd.rolling_std(returns, 21, 21) * 252**0.5
# 6-month stock returns
trail6monthreturns = prices/prices.shift(21*6) - 1
# Rank of 6 month stock returns
retrank = trail6monthreturns.rank(axis=1, ascending=False)
# Portfolio of the top 100 stocks as measured by 6 month return
positions = retrank.apply(lambda x: np.where(x<= 100, 1, np.nan))
# Daily returns for top 100 stocks
uptrendreturns = positions * returns
# Daily return for 100 stock portfolio
portfolioreturns = uptrendreturns.mean(1)
return positions, portfolioreturns
One thought I had was to use the HDF5 storage format instead of CSVs as through recent testing and a perusal of the pandas documentation and stackoverlfow I see that it is much faster for input/output and less memory intensive during such operations. Any thoughts on this? For example, I store the daily open, high, low, close, volume, shares outstanding, PE ratio, earnings growth, and another 30 different measures like this in a separate CSV for each (like in the example above, typically 6000 stocks x 4000 trading dates for each). If a swtich to HDF5 is recommended, should just I store these same DataFrames in 30+ separate H5 files?
In the function above, if I wanted to have access to some of the intermediate results after the completion of the function, but not use up memory, would it make sense to store the results in a "temp" folder containing an HDF5 file? For example:
def momentum_strategy_hdf5():
# prices.csv is a matrix containing stock prices for 6000 stocks
# and 4000 trading dates
prices = pd.read_csv("prices.csv")
s = pd.HDFStore("temp.h5")
# Daily stock returns
s['returns'] = prices/prices.shift(1) -1
# Annualized return volatility
s['volatility'] = pd.rolling_std(s['returns'], 21, 21) * 252**0.5
# 6-month stock returns
s['trail6monthreturns'] = prices/prices.shift(21*6)
# Rank of 6 month stock returns
s['retrank'] = s['trail6monthreturns'].rank(axis=1, ascending=False)
# Portfolio of the top 100 stocks as measured by 6 month return
s['positions'] = s['retrank'].apply(lambda x: np.where(x<= 100, 1, np.nan))
# Daily returns for top 100 stocks
s['uptrendreturns'] = s['positions'] * s['returns']
# Daily return for 100 stock portfolio
s['portfolioreturns'] = s['uptrendreturns'].mean(1)
return s['positions'], s['portfolioreturns']
Edit: I just tested the above two functions, and the first one took 15 seconds, while the second took 42 seconds. So the second one as written is much slower, but hopefully there's a better way?