a faster solution for random text w/r in Python

https://stackoverflow.com/questions/19952619

30-07-2022
|

Question

I need a fast solution for random w/r of text snippets in Python. What I want to do is like this:

Write the snippet and record a pointer
Use the pointer to retrieve the snippet

The snippets are of arbitrary length and I choose not to use a database to store them, but only the pointers. By simply replacing Python file methods with C functions (solution 1), it's been pretty fast and the pointers consist of only "where" and "how long" of the snippet. After that, I experimented what I thought is the real thing that works with Berkeley DB. I don't know what to call it, a "paging" something perhaps?

The thing is, this code definitely works, 1.5 to 2 times faster than solution 1, but it isn't a lot faster and needs to use a 4-part pointer. Perhaps this is not a worthy method, but is there any room to significantly improve it?

The following is the code:

from collections import namedtuple
from ctypes import cdll,c_char_p,\
     c_void_p,c_size_t,c_long,\
     c_int,create_string_buffer
libc = cdll.msvcrt
fopen = libc.fopen
fread = libc.fread
fwrite = libc.fwrite
fseek = libc.fseek
ftell = libc.ftell
fflush = libc.fflush
fclose = libc.fclose

#######################################################
# The following is how to write a snippet into the SnippetBase file

ptr = namedtuple('pointer','blk1, start, nblk, length')
snippet = '''
blk1: the first blk where the snippet is
start: the start of this snippet
nblk: number of blocks this snippet takes
length: length of this snippet
'''
bsize = 4096 # bsize: block size

fh = fopen('.\\SnippetBase.txt','wb')
fseek(fh,0,2)
pos1 = divmod(ftell(fh),bsize)
fwrite(snippet,c_size_t(len(snippet)),1,fh)
fflush(fh)
pos2 = divmod(ftell(fh),bsize)
ptr = ptr(pos1[0],pos1[1],pos2[0]-pos1[0]+1,len(snippet))
fclose(fh)


#######################################################
# The following is how to read the snippet from the SnippetBase file

fh = fopen('.\\SnippetBase.txt','rb')
fseek(fh,c_long(ptr.blk1*bsize),1)
buff = create_string_buffer(ptr.nblk*bsize)
fread(buff,c_size_t(ptr.nblk*bsize),1,fh)
print buffer(buff,ptr.start,ptr.length)
fclose(fh)

La solution

This looks like a hard and non-portable way to optimize away one thing - the memory allocation performed by the Python wrappers file.read and os.read. All of the other parts are easily done with already existing functions in the Python standard library. There's even a simple method to allocate a read/write buffer in bytearray. The io module does contain a method readinto, which is present in file types; I highly suspect this does avoid the allocation. On the most popular operating systems we can go one step further, however - by using the OS disk buffer directly instead of allocating memory local to our process. This is done using mmap (but it becomes tricky to use when the file is too large to fit in your address space). For a non-allocating method to read data out of a mmaped file, simply use buffer(mm, offset, size).

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow