Try opening a file as an archive, otherwise read as a regular file

https://stackoverflow.com/questions/15538338

28-03-2022
|

Question

I am trying to process a list of files, where each may be a regular text file OR a bz2 archive.

How can I use try-except blocks most efficiently to attempt to open each file in the appropriate format? I would rather not check the file's extension, as this cannot always be relied upon (and is not very EAFP).

Currently I am doing:

def data_generator(*corpora):
    def parse_lines(fobj):
        for line in fobj:
            # Do lots of processing.
            # ...
            # Many lines here omitted.
            yield ('lots', 'of', 'data')

    for corpus in corpora:
        try:
            with bz2.BZ2File(corpus, mode='r') as f:
                for data in parse_lines(f):
                    yield data
        except IOError:
            with codecs.open(corpus, encoding='utf-8') as f:
                for data in parse_lines(f):
                    yield data

I think the repeated for data in parse_lines(f): ... code looks superfluous, but I can't think of a way to get rid of it. Is there any way to reduce the previous, or is there another way to try to "smart open" a file?

Edit: Optional followup

What would be an appropriate way to scale up the number of formats checked? As an example, the program 7zip allows you to right-click on any file and attempt to open it as an archive (any that 7zip supports). With the current try-except block strategy, it seems like you would start getting nested in blocks pretty quickly even after just a few formats, like:

try:
    f = ...
except IOError:
    try:
        f = ...
    except IOError:
        try:
            ...

Solution

If it's really just the duplicate loops that have got you concerned, you could move f out of the scope of the try-catch block, then put a single copy of the loop after everything's said and done:

try:
    f = bz2.BZ2File(corpus, mode='r')
except IOError:
    f = codecs.open(corpus, encoding='utf-8')
for data in parse_lines(f):
    yield data
f.close()

Although I'd look into only opening the file once, checking for the BZ2 header (the characters BZ as the first two bytes), and using that to decide whether to continue reading it as plaintext, or pass the data into a bz2.BZ2Decompressor instance.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow