Python MemoryError: The processed data set does not fit into 32-bit process address space

https://stackoverflow.com/questions/20759695

21-09-2022
|

Вопрос

I am trying to create a large nested dict in python, but the program runs out of memory (fails with MemoryError). (I am aware that 64-bit Python can use more memory that 32-bit, but am looking to find an option that will work on 32-bit as well as 64-bit.) Code:

# These lists are the keys for the nested dict:
cities = [
    'Amsterdam', 'Athens', 'Bangkok', 'Barcelona', 'Berlin', 'Brussels', 'Budapest', 'Cologne', 'Geneva', 'Kiev',
    'Lisbon', 'London', 'Lyon', 'Madrid', 'Manchester', 'Manila', 'Minsk', 'Moscow', 'New York', 'Oslo', 'Paris',
    'Prague', 'Rome', 'Sofia', 'Stockholm', 'Taipei', 'Tokyo', 'Vienna'
]
years = range(1950, 2013)
color_codes = ['Green', 'Yellow', 'Red']
source_type_ids = range(1, 6)
precision_categories = range(1, 4)
ages = range(150)

print("Number of elements (=lists) in array: {:,d}".format(
    len(cities) * len(years) * len(color_codes) * len(source_type_ids) * len(precision_categories) * len(ages))
)

# Create nested dict of lists with test values
a = {}
for city in cities:
    a[city] = {}
    for year in years:
        a[city][year] = {}
        for color_code in color_codes:
            a[city][year][color_code] = {}
            for source_type_id in source_type_ids:
                a[city][year][color_code][source_type_id] = {}
                for precision_category in precision_categories:
                    a[city][year][color_code][source_type_id][precision_category] = {}
                        a[city][year][color_code][source_type_id][precision_category][age] = [
                            float(x) for x in range(30)
                        ]  # Just an example list of floats

print(a['Paris'][2005]['Red'][4][3][65])  # Not reached due to MemoryError

What might be a better way to store these data while working on them? I have come across many seemingly relevant technologies on stackoverflow and elsewhere, but I still have no idea which would be easy to use, fast, or otherwise good.

Anyone familiar with one or more of the below or other suitable technologies - your comments would be much appreciated, even if it is just to say you don't think that option such-and-such is a good fit for what I am trying to do, which is to find a way to work with a large, multidimensional array that will not fit in memory all at once.

shelve
pickle
HDF5 (h5py)
PyTables
Pandas HDFStore (based on PyTables)
numpy.memmap
SQLite
NoSQL, e.g. MongoDB (PyMongo)
Oracle Berkeley DB (PyBSDDB)
Hadoop (Pydoop) (map/reduce)
(The expression "Disk-based dictionaries" is used on the following webpage, but these libraries seem mostly out of date(?) except what is already mentioned above: https://wiki.python.org/moin/PersistenceTools)

Some more info:

The lists will not have the same length.
The code will largely work on one list at a time, calculating such things as standard deviation and mean (and other things, I would really like to keep the lists in stead of just accumulating sums).
Hence, query-like ways of accessing the data (like PyTables offers) are not a must.
Speed is important - the code will frequently read and/or change many lists. Writes will happen maybe 20 times more often than reads.
It would be nice if I could change the array shape (in particular add more cities in the code shown) without rebuilding everything, but this too is not a must.

Edit to answer questions in comments:

The memory problem appeared early in the process, and the rest of the code is very far from ready, but I have included some more example/pseudo code that hopefully will give a better idea of the usage (sorry for the wall of text question this is becoming).

The lists in the nested array will contain floats, not ints - I have now updated the example code to relfect that.

The values come from a large csv-file which is processed row by row.

# Write example: Read from csv file, calculate values, append them to lists that are elements of the nested array a:
for csv_row in csv_rows:
    # The CSV-rows either contains the same keys for the dict or "counts up", meaning that city, year, color_code,
    # source_type_id for consequtive rows look something like:
    # 'London', 2001, 'Yellow', 3, ...
    # 'London', 2001, 'Yellow', 3, ...
    # 'London', 2001, 'Yellow', 4, ...
    # 'London', 2001, 'Yellow', 4, ...
    # 'London', 2001, 'Yellow', 4, ...
    # 'London', 2001, 'Red', 1, ...
    # ...
    # 'London', 2002, 'Green', 1, ...
    # precision_category and age, however, are in this respect arbitrary for each row
    city, year, color_code, source_type_id, precision_category = get_keys(csv_row)
    dict_of_floats = do_calculations(csv_row)  # Key in this dict is age, and values are to be appended to lists in the
        # array.  Example dict: {33: 4.2, 10: 0.9, 63: 1.3, 64: 9.1, 44: 3.0, 7: 2.5}
    for age, value in dict_of_floats.items():
        a[city][year][color_code][source_type_id][precision_category][age].append(value)  # todo: handle 1st list item
            # In the end, a[city][year][color_code][source_type_id][precision_category][age] is a list which may
            # contain anywhere from zero to thousands of values, on average maybe around 30 values


# Read example: Plot two graphs - one for the standard deviation of the list and on for the mean - over all ages for
# user given city, year, color_code, precision_category (any of these may be 'All' for a weighted average over the
# relevant dimension (not implemented yet))
# (Other plots and analyses to be added later, mostly separate for each list)
stdev_series = []
mean_series = []
for age in a[city][year][color_code][source_type_id][precision_category]:
    stdev_series.append(get_stdev_of_list(a[city][year][color_code][source_type_id][precision_category][age]))
    mean_series.append(get_mean_of_list(a[city][year][color_code][source_type_id][precision_category][age]))

Решение

As there is no way to fit the structures to memory once, you need to have some kind of database to manage data in-disk.

If you are dealing with the data solely inside Python you could use ZODB which stores data structure on the disk using Python native pickle format, so the code needed to handle the extra effort of using disk-backed database is kept in minimum. No data conversion or mappings are needed. Also ZODB is blazingly fast.

Namely,

Use PersistentDict or BTrees to store your nested data
Do transaction.commit after you have processed few rows, so that that the active data set in the memory does not grow too large
When querying the data, organize your queries so that the single transaction doesn't iterate all the data and thus try to force the data to be loaded in the memory once

ZODB has been successfully used data sets over several gigabytes even in 32-bit process mode, so it should have no inherit limitations regarding the data set size.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow