numpy.memmap for an array of strings?

https://stackoverflow.com/questions/5896747

29-10-2019
|

Question

Is it possible to use numpy.memmap to map a large disk-based array of strings into memory?

I know it can be done for floats and suchlike, but this question is specifically about strings.

I am interested in solutions for both fixed-length and variable-length strings.

The solution is free to dictate any reasonable file format.

Solution

If all the strings have the same length, as suggested by the term "array", this is easily possible:

a = numpy.memmap("data", dtype="S10")

would be an example for strings of length 10.

Edit: Since apparently the strings don't have the same length, you need to index the file to allow for O(1) item access. This requires reading the whole file once and storing the start indices of all strings in memory. Unfortunately, I don't think there is a pure NumPy way of indexing without creating an array the same size as the file in memory first. This array can be dropped after extracting the indices, though.

OTHER TIPS

The most flexible option would be to switch to a database or some other more complex on-disk file structure.

However, there's probably some good reason that you'd rather keep things as a plain text file...

Because you have control of how the files are created, one option is to simply write out a second file that only contains the starting positions (in bytes) of each string in the other file.

This would require a bit more work, but you could essentially do something like this:

class IndexedText(object):
    def __init__(self, filename, mode='r'):
        if mode not in ['r', 'w', 'a']:
            raise ValueError('Only read, write, and append is supported')
        self._mainfile = open(filename, mode)
        self._idxfile = open(filename+'idx', mode)

        if mode != 'w':
            self.indicies = [int(line.strip()) for line in self._idxfile]
        else:
            self.indicies = []

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self._mainfile.close()
        self._idxfile.close()

    def __getitem__(self, idx):
        position = self.indicies[idx]
        self._mainfile.seek(position)
        # You might want to remove the automatic stripping...
        return self._mainfile.readline().rstrip('\n')

    def write(self, line):
        if not line.endswith('\n'):
            line += '\n'
        position = self._mainfile.tell()
        self.indicies.append(position)
        self._idxfile.write(str(position)+'\n')
        self._mainfile.write(line)

    def writelines(self, lines):
        for line in lines:
            self.write(line)


def main():
    with IndexedText('test.txt', 'w') as outfile:
        outfile.write('Yep')
        outfile.write('This is a somewhat longer string!')
        outfile.write('But we should be able to index this file easily')
        outfile.write('Without needing to read the entire thing in first')

    with IndexedText('test.txt', 'r') as infile:
        print infile[2]
        print infile[0]
        print infile[3]

if __name__ == '__main__':
    main()

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow