Question

I am working on an algorithm in Python that uses arrays of int64s heavily. The arrays are typically sparse and are read from and written to constantly. I am currently using relatively large native arrays and the performance is good but the memory usage is high (as expected).

I would like to be able to have the array implementation not waste space for values that are not used and allow an index offset other than zero. As an example, if my numbers start at 1,000,000 I would like to be able to index my array starting at 1,000,000 and not be required to waste memory with a million unused values.

Array reads and writes needs to be fast. Expanding into new territory can be a small delay but reads and writes should be O(1) if possible.

Does anybody know of a library that can do it?

Thanks!

Updated to mention int64 as the data type.

Was it helpful?

Solution

It sounds like the blist type (documentation, download) might be just what you're looking for (disclaimer: I'm the author). It has exactly the same interface as Python's list, so there's no learning curve, but it has different performance characteristics. In particular, it can efficiently handle sparse lists in many cases. Below is an example that creates a list with 2**29 elements. It's pretty much instantaneous. Sparse lists created in this manner use O(log n) space.

>>> from blist import *
>>> x = blist([0])             # x is a blist with one element
>>> x *= 2**29                 # x is a blist with > 500 million elements
>>> x.append(5)                # append to x
>>> y = x[4:-234234]           # Take a 500 million element slice from x
>>> del x[3:1024]              # Delete a few thousand elements from x

Operations that iterate over the whole list still take O(n) time (remove, reverse, count, etc.). The documentation spells out the time-complexity for each operation, so you should be able to assess if it will meet your needs.

OTHER TIPS

You could remap a numpy sparse matrix into a sparse array - or consider using a hash table(a python dict). As far as the offset is concerned, just wrap whatever storage class you are using and make you own insert/lookup/remove methods.

I don't know any Python so this is probably an un-answer:

In some languages you can simulate a sparse array by defining a function from your index space into your data space. For example (pseudo-code):

f[1000000] = 32;
f[2000000] = 51;

In some languages (the best ones) the form of an array reference (eg f[3]) looks just like the form of a function call (eg f[3]). This is, of course, because an array defines a function from an index space into a data space. It's very easy to implement higher-dimensioned sparse arrays this way too.

Why not just use a dict?

sparse = dict()
sparse[100000] = 1234
sparse[123456] = 2345

Another option - at least if you're willing to implement one yourself - is a Page table. This is commonly used in virtual memory systems to map virtual addresses to physical addresses, and it works best if your address space is sparsely populated, and used addresses are clustered. If used addresses are randomly distributed, this will be less effective.

The basic approach of a page table is the same as a Trie - recursive subdivision. A page table has some fixed number of levels, and each node is an array of a fixed size. If the entry for a given subnode is null, all the leaves covered by that node are null. The main advantage of the page table is that lookups are fast, requiring only a few bit-shifts and dereferences.

Let's see a straightforward Python implementation of a 2-level pagetable:

class Pagetable(object):
  def __init__(self, num_bits=8):
    """Creates a new Pagetable with num_bits bits per level.

    Args:
      num_bits: The number of bits per pagetable level.
        A 2 level pagetable will be able to store indexes between 0 and 2^(num_bits*2).
    """
    self.num_bits = num_bits
    self.mask = (1 << num_bits) - 1
    self.root = [None] * (2 ** num_bits)

  def __getitem__(self, idx):
    page = self.root[idx >> self.num_bits]
    return page and page[idx & self.mask]

  def __setitem__(self, idx, val):
    page = self.root[idx >> self.num_bits]
    if not page:
      page = self.root[idx >> self.num_bits] = [None] * (2 ** self.num_bits)
    page[idx & self.mask] = val
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top