문제

I am using Pandas to store, load, and manipulate financial data. A typical data file is a 6000x4000 DataFrame (6000 stocks x 4000 trading dates) which if say half the stocks have value N/A on given date, will be of size 200MB in CSV format. I have been using a workstation that has 16GB of memory, which has been sufficient for loading entire CSVs of this size into memory, performing various calculations, and then storing the results. In a typical day I end up using about 10GB of RAM during peak usage. I have the feeling that I could be doing things more efficiently though. I would like to get that number down to around 2GB so I can run my daily update of several models using a regular laptop with 4GB of RAM. Is this reasonable? Am I using too much memory as is regardless of my hardware?

I understand the answer to the above depends upon the details of what I am doing. Here is an example of the type of function I might run:

def momentum_strategy():
    # prices.csv is a matrix containing stock prices for 6000 stocks
    # and 4000 trading dates
    prices = pd.read_csv("prices.csv")
    # Daily stock returns
    returns = prices/prices.shift(1) -1 
    # Annualized return volatility
    volatility = pd.rolling_std(returns, 21, 21) * 252**0.5
    # 6-month stock returns
    trail6monthreturns = prices/prices.shift(21*6) - 1
    # Rank of 6 month stock returns
    retrank = trail6monthreturns.rank(axis=1, ascending=False)
    # Portfolio of the top 100 stocks as measured by 6 month return
    positions = retrank.apply(lambda x: np.where(x<= 100, 1, np.nan))
    # Daily returns for top 100 stocks
    uptrendreturns = positions * returns
    # Daily return for 100 stock portfolio
    portfolioreturns = uptrendreturns.mean(1)
    return positions, portfolioreturns

One thought I had was to use the HDF5 storage format instead of CSVs as through recent testing and a perusal of the pandas documentation and stackoverlfow I see that it is much faster for input/output and less memory intensive during such operations. Any thoughts on this? For example, I store the daily open, high, low, close, volume, shares outstanding, PE ratio, earnings growth, and another 30 different measures like this in a separate CSV for each (like in the example above, typically 6000 stocks x 4000 trading dates for each). If a swtich to HDF5 is recommended, should just I store these same DataFrames in 30+ separate H5 files?

In the function above, if I wanted to have access to some of the intermediate results after the completion of the function, but not use up memory, would it make sense to store the results in a "temp" folder containing an HDF5 file? For example:

def momentum_strategy_hdf5():
    # prices.csv is a matrix containing stock prices for 6000 stocks
    # and 4000 trading dates
    prices = pd.read_csv("prices.csv")
    s = pd.HDFStore("temp.h5")
    # Daily stock returns
    s['returns'] = prices/prices.shift(1) -1 
    # Annualized return volatility
    s['volatility'] = pd.rolling_std(s['returns'], 21, 21) * 252**0.5
    # 6-month stock returns
    s['trail6monthreturns'] = prices/prices.shift(21*6)
    # Rank of 6 month stock returns
    s['retrank'] = s['trail6monthreturns'].rank(axis=1, ascending=False)
    # Portfolio of the top 100 stocks as measured by 6 month return
    s['positions'] = s['retrank'].apply(lambda x: np.where(x<= 100, 1, np.nan))
    # Daily returns for top 100 stocks
    s['uptrendreturns'] = s['positions'] * s['returns']
    # Daily return for 100 stock portfolio
    s['portfolioreturns'] = s['uptrendreturns'].mean(1)
    return s['positions'], s['portfolioreturns']

Edit: I just tested the above two functions, and the first one took 15 seconds, while the second took 42 seconds. So the second one as written is much slower, but hopefully there's a better way?

도움이 되었습니까?

해결책

Here is a typical work-flow for this type of data:

  • 1) read in csv data, convert to DataFrame, coerce data type, write out using HDFStore (depending on your needs could be 'fixed' or 'table' format). Do this is a separate process, then exit the process. When the dataset is large, I read it in a logical format (e.g. say a range of dates), then output a 'table' format HDF5 file. Then can append to this.

  • 2) query (again could be on dates or some other criteria). perform calculations, then write out NEW HDF5 files. This can be done in parallel (multiple processes). MAKE SURE THAT YOU ARE WRITING SEPARATE FILES IN EACH PROCESS.

  • 3) combine the prior data files into single HDF5 files. This is a SINGLE process event.

  • 4) repeat 2 & 3 as needed.

The key is to do discrete steps, writing out intermediate data in between, and exiting processes in-between. This keeps a manageable in-memory data size and makes in-memory calculations fast. Further this allows multiple processing for cpu-intensive operations on a read-only HDF5 file.

It is important to do this in separate system processes to allow the system to reclaim memory.

HTH

다른 팁

Although I don't have much experience using HDF5 files, I'll suggest three Python libraries that might get you going in a better direction

H5py is a Python library specifically built for encoding and decoding files to and from binary formats. I'm not claiming that it is better than Pandas HDFstore (I've found Pandas to be pretty awesome with handling sizable amounts of data 2.2M x 24) but it might do the trick.

PyTables has been mentioned a few times in memory management conversations. I have no experience with this library but I have seen it in discussions dealing with memory/HDf5 problems.

mmap is a library used for memory mapping (The process of moving data from the disk, into memory for manipulating without the use of binary formatting). If you guessed that I have no experience using this library, then you would be a winner.

Again, I can't talk much from experience here, but I think these three routes might get you on you way of better utilizing your memory with Python when dealing with large data sets.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top