Question

This snippet pulls all of the docs out of my database, and dumps them into a gzip-compressed file. docs_to_dump is a django object containing all of the text documents to be dumped.

os.chdir(dump_dir)
filename = 'latest-' + court_id + '.xml.gz.part'
with myGzipFile(filename, mode='wb') as z_file:
    z_file.write('<?xml version="1.0" encoding="utf-8"?>\n<opinions dumpdate="' + str(date.today()) + '">\n')

    for doc in docs_to_dump:
        row = etree.Element("opinion",
            dateFiled           = str(doc.dateFiled),
            precedentialStatus  = doc.documentType,
            local_path          = str(doc.local_path),
            time_retrieved      = str(doc.time_retrieved),
            download_URL        = doc.download_URL,
            caseNumber          = doc.citation.caseNumber,
            caseNameShort       = doc.citation.caseNameShort,
            court               = doc.court.get_courtUUID_display(),
            sha1                = doc.documentSHA1,
            source              = doc.get_source_display(),
            id                  = str(doc.documentUUID),
        )
        if doc.documentHTML != '':
            row.text = doc.documentHTML
        else:
            row.text = doc.documentPlainText.translate(null_map)
        z_file.write('  ' + etree.tostring(row).encode('utf-8') + '\n')

    # Close things off
    z_file.write('</opinions>')

Unfortunately, it also consumes so much memory that the OS nukes it. I thought that by writing to a "File-like object", the compressed file would get made on the fly, and that memory would remain relatively low. Instead, it's taking up hundreds of MB, then crashing.

I'm not an expert on compression, but my impression is that the whole compressed file is getting stored in memory.

Is there a better way I ought to be doing this?

EDIT -- The whole file is here: https://bitbucket.org/mlissner/search-and-awareness-platform-courtlistener/src/2ca68efd8017/data-dumps/data-dumper.py

Was it helpful?

Solution

I think andrewski might be right. If you are crashing, try adjusting your Query's to use the iterator method

Something like.

docs_to_dump = Document.objects.all().order_by('court').iterator()

Should keep from loading your entire Queryset into memory.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top