Question

So recently I took on as a personal project to make my very own DB in Python, mainly because I hate messing arround with most DBs and I needed something easy to setup, portable and simple to study large data sets.

I now find myself stuck on a problem, an efficient way to delete a line from the DB file (which is really just a text file). The way I found to do it is to write all of the content thats after the line before it, and then truncate the file (I take suggestions on better ways to do it). The problem arrives when I need to write the content after the line before it, because doing it all at once could possibly load millions of lines onto the RAM at once. The code follows:

ln = 11  # Line to be deleted
with open("test.txt", "r+") as f:
    readlinef = f.readline
    for i in xrange(ln):
        line = readlinef()

    length, start = (len(line), f.tell()-len(line))
    f.seek(0, 2)
    chunk = f.tell() - start+length
    f.seek(start+length, 0)

    # How to make this buffered?
    data = f.read(chunk)
    f.seek(start, 0)
    f.write(data)
    f.truncate()

Right now thats reading all of that data at once, how would I make that last code block work in a buffered fashion? The start position would switch every time a new chunk of data is written before it, I was wondering what would be the most efficient and fast (execution time wise) way to do this.

Thanks in advance.

edit

I've decided to follow the advices submitted here, but just for curiosity's sake I found a way to read and write in chunks. It follows:

with open("test.txt", "r+") as f:
    readlinef = f.readline
    for i in xrange(ln):
        line = readlinef()

    start, length = (f.tell()-len(line), len(line))

    readf = f.read
    BUFFER_SIZE = 1024 * 1024

    x = 0
    chunk = readf(BUFFER_SIZE)
    while chunk:
        f.seek(start, 0)
        f.write(chunk)
        start += BUFFER_SIZE
        f.seek(start+length+(x*BUFFER_SIZE), 0)
        chunk = readf(BUFFER_SIZE)

    f.truncate()
Was it helpful?

Solution 2

You can do this the same way that (effectively) memmove works: seek back and forth between the source range and the destination range:

count = (size+chunksize-1) // chunk size
for chunk in range(count):
    f.seek(start + chunk * chunksize + deleted_line_size, 0)
    buf = f.read(chunksize)
    f.seek(start + chunk * chunksize, 0)
    f.write(buf)

Using a temporary file and shutil makes it a lot simpler—and, despite what you're expect, it may actually be faster. (There's twice as much writing, but a whole lot less seeking, and mostly block-aligned writing.) For example:

with tempfile.TemporaryFile('w') as ftemp:
    shutil.copyfileobj(ftemp, f)
    ftemp.seek(0, 0)
    f.seek(start, 0)
    shutil.copyfileobj(f, ftemp)
f.truncate()

However, if your files are big enough to fit in your virtual memory space (which they probably are in 64-bit land, but may not be in 32-bit land), it may be simpler to just mmap the file and let the OS/libc take care of the work:

m = mmap.mmap(f.fileno(), access=mmap.ACCESS_WRITE)
m[start:end-deleted_line_size] = m[start+deleted_line_size:end]
m.close()
f.seek(end-deleted_line_size)
f.truncate()

OTHER TIPS

Answering your question "How would I do that?" concerning indices and vacuum.

Disclaimer: This is a very simple example and does in no way compare to existing DBMS and I strongly advise against it.

Basic idea:

For each table in your DB, keep various files, some for your object ids (row ids, record ids) and some (page files) with the actual data. Let's suppose that each record is of variable length.

Each record has a table-unique OID. These are stored in the oid-files. Let's name the table "test" and the oid files "test.oidX". Each record in the oid file is of fixed length and each oid file is of fixed length.

Now if "test.oid1" reads:

0001:0001:0001:0015 #oid:pagefile:position:length
0002:0001:0016:0100
0004:0002:0001:0001

It means that record 1 is in page file 1, at position 1 and has length 15. Record 2 is in page file 1 at position 16 of length 100, etc.

Now when you want to delete a record, just touch the oid file. E.g. for deleting record 2, edit it to:

0001:0001:0001:0015
0000:0001:0016:0100 #0000 indicating empty cell
0004:0002:0001:0001

And don't even bother touching your page files.

This will create holes in your page files. Now you need to implement some "maintenance" routine which moves blocks in your page files around, etc, which could either run when requested by the user, or automatically when your DBMS has nothing else to do. Depending on which locking strategy you use, you might need to lock the concerned records or the whole table.

Also when you insert a new record, and you find a hole big enough, you can insert it there.

If your oid-files should also function as an index (slow inserts, fast queries), you will need to rebuild it (surely on insertion, maybe on deletion).

Operations on oid-files should be fast, as they are fixed-length and of fixed-length records.

This is just the very basic idea, not touching topics like search trees, hashing, etc, etc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top