I am trying to create a large nested dict in python, but the program runs out of memory (fails with MemoryError). (I am aware that 64-bit Python can use more memory that 32-bit, but am looking to find an option that will work on 32-bit as well as 64-bit.) Code:
# These lists are the keys for the nested dict:
cities = [
'Amsterdam', 'Athens', 'Bangkok', 'Barcelona', 'Berlin', 'Brussels', 'Budapest', 'Cologne', 'Geneva', 'Kiev',
'Lisbon', 'London', 'Lyon', 'Madrid', 'Manchester', 'Manila', 'Minsk', 'Moscow', 'New York', 'Oslo', 'Paris',
'Prague', 'Rome', 'Sofia', 'Stockholm', 'Taipei', 'Tokyo', 'Vienna'
]
years = range(1950, 2013)
color_codes = ['Green', 'Yellow', 'Red']
source_type_ids = range(1, 6)
precision_categories = range(1, 4)
ages = range(150)
print("Number of elements (=lists) in array: {:,d}".format(
len(cities) * len(years) * len(color_codes) * len(source_type_ids) * len(precision_categories) * len(ages))
)
# Create nested dict of lists with test values
a = {}
for city in cities:
a[city] = {}
for year in years:
a[city][year] = {}
for color_code in color_codes:
a[city][year][color_code] = {}
for source_type_id in source_type_ids:
a[city][year][color_code][source_type_id] = {}
for precision_category in precision_categories:
a[city][year][color_code][source_type_id][precision_category] = {}
a[city][year][color_code][source_type_id][precision_category][age] = [
float(x) for x in range(30)
] # Just an example list of floats
print(a['Paris'][2005]['Red'][4][3][65]) # Not reached due to MemoryError
What might be a better way to store these data while working on them? I have come across many seemingly relevant technologies on stackoverflow and elsewhere, but I still have no idea which would be easy to use, fast, or otherwise good.
Anyone familiar with one or more of the below or other suitable technologies - your comments would be much appreciated, even if it is just to say you don't think that option such-and-such is a good fit for what I am trying to do, which is to find a way to work with a large, multidimensional array that will not fit in memory all at once.
- shelve
- pickle
- HDF5 (h5py)
- PyTables
- Pandas HDFStore (based on PyTables)
- numpy.memmap
- SQLite
- NoSQL, e.g. MongoDB (PyMongo)
- Oracle Berkeley DB (PyBSDDB)
- Hadoop (Pydoop) (map/reduce)
- (The expression "Disk-based dictionaries" is used on the following webpage, but these libraries seem mostly out of date(?) except what is already mentioned above: https://wiki.python.org/moin/PersistenceTools)
Some more info:
- The lists will not have the same length.
- The code will largely work on one list at a time, calculating such things as standard deviation and mean (and other things, I would really like to keep the lists in stead of just accumulating sums).
- Hence, query-like ways of accessing the data (like PyTables offers) are not a must.
- Speed is important - the code will frequently read and/or change many lists. Writes will happen maybe 20 times more often than reads.
- It would be nice if I could change the array shape (in particular add more cities in the code shown) without rebuilding everything, but this too is not a must.
Edit to answer questions in comments:
The memory problem appeared early in the process, and the rest of the code is very far from ready, but I have included some more example/pseudo code that hopefully will give a better idea of the usage (sorry for the wall of text question this is becoming).
The lists in the nested array will contain floats, not ints - I have now updated the example code to relfect that.
The values come from a large csv-file which is processed row by row.
# Write example: Read from csv file, calculate values, append them to lists that are elements of the nested array a:
for csv_row in csv_rows:
# The CSV-rows either contains the same keys for the dict or "counts up", meaning that city, year, color_code,
# source_type_id for consequtive rows look something like:
# 'London', 2001, 'Yellow', 3, ...
# 'London', 2001, 'Yellow', 3, ...
# 'London', 2001, 'Yellow', 4, ...
# 'London', 2001, 'Yellow', 4, ...
# 'London', 2001, 'Yellow', 4, ...
# 'London', 2001, 'Red', 1, ...
# ...
# 'London', 2002, 'Green', 1, ...
# precision_category and age, however, are in this respect arbitrary for each row
city, year, color_code, source_type_id, precision_category = get_keys(csv_row)
dict_of_floats = do_calculations(csv_row) # Key in this dict is age, and values are to be appended to lists in the
# array. Example dict: {33: 4.2, 10: 0.9, 63: 1.3, 64: 9.1, 44: 3.0, 7: 2.5}
for age, value in dict_of_floats.items():
a[city][year][color_code][source_type_id][precision_category][age].append(value) # todo: handle 1st list item
# In the end, a[city][year][color_code][source_type_id][precision_category][age] is a list which may
# contain anywhere from zero to thousands of values, on average maybe around 30 values
# Read example: Plot two graphs - one for the standard deviation of the list and on for the mean - over all ages for
# user given city, year, color_code, precision_category (any of these may be 'All' for a weighted average over the
# relevant dimension (not implemented yet))
# (Other plots and analyses to be added later, mostly separate for each list)
stdev_series = []
mean_series = []
for age in a[city][year][color_code][source_type_id][precision_category]:
stdev_series.append(get_stdev_of_list(a[city][year][color_code][source_type_id][precision_category][age]))
mean_series.append(get_mean_of_list(a[city][year][color_code][source_type_id][precision_category][age]))