Question

I was having trouble with Python script opening a file which contained an umlaut character. Naturally I thought I could correct this with a unicode utf8 fix, but not so...

I ended up using the mbcs ( default is cp1252)

Then I wrote this statement of which I wish to write MUCH cleaner,

def len(fname):
i = -1
try:
    with open(fname, encoding='mbcs') as f:
        for i, l in enumerate(f):
            pass
except UnicodeDecodeError:
    try:
        i = -1
        with open(fname, encoding='utf8') as f:
            for i, l in enumerate(f):
                pass
    except UnicodeDecodeError:
        i = -1
        with open(fname) as f:
            for i, l in enumerate(f):
                pass
return i + 2 # 2 because it starts at -1 not 0
Was it helpful?

Solution

You're almost certainly going about this all wrong, as explained in the comments… but if you really do need to do something like this, here's how to simplify it:

The general solution to avoid repeating yourself is to use a loop. You've got the same code three times, with the only difference being the encoding, so loop over three encodings instead. (In your case, the third loop didn't pass an encoding at all, so you do have to know the default value of the parameter, but the docs or help will tell you that.) The only wrinkle is that you apparently don't want to handle exceptions in the third case; the easiest way to do that is to reraise the last exception if they all fail.

While we're at it: There's no need to "declare" i up-front the way you do; the for loop is just going to start at 0 and erase whatever you put there. That also means the +2 at the end is wrong. But there's an easier way to get the length of an iterable in the first place: just feed it into something that consumes generator expressions. A custom ilen function written in C would be ideal, but people have tested various different Python implementations, and sum(1 for _ in iterable) is almost as fast as the perfect solution, and dead simple, so it's the most common idiom. If this isn't obvious you to, factor it out as a function and call it lien, and give it a nice docstring and/or comment. Or just pip install more-itertools and then you can just call more_itertools.ilen(f).

Anyway, putting it all together:

def len(fname):
    for encoding in 'mbcs', 'utf8', None:
        try:
            with open(fname, encoding=encoding) as f:
                return sum(1 for line in f)
        except UnicodeDecodeError as e:
            pass
    raise e

OTHER TIPS

It’s not entirely clear to me what you want: if you just want to count the lines, ignore the errors! – This is pretty safe, as practically all encodings use the same ASCII compatible line endings (except UTF-16...).

open(fname, errors='ignore')

And you never get an exception. Done.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top