Remove code duplication from RLE pattern, without resorting to Haskell?

https://stackoverflow.com/questions/11358598

19-06-2021
|

Pregunta

The RLE (run length encoding) pattern seems to come up a lot in my work.

The essence of it is that you are outputting a reduction of the elements encountered since the last 'break' each time that you see a 'break' or you reach the end of the input.

(In actual RLE, the 'break' is just this character not matching the last character, but in the real world it's usually a little more complex, but still a function of the current and last elements.)

I want to remove the duplicate last_val != None: rle.append((last_val, count)) condition and action which occur both in the loop and at the end.

The issues are:

replacing them with function calls results in more code, not less.
keeping it in imperative style (in Haskell, for example, the problem just evapourates).

The imperative Python code is:

#!/usr/bin/env python

data = "abbbccac"

if __name__ == '__main__':
  rle = []
  last_val = None
  count = 0;

  for val in data:
    if val != last_val and last_val != None:
      rle.append((last_val, count))
      count = 1
    else:
      count += 1
    last_val = val
  if last_val != None:
    rle.append((last_val, count))

  print rle

P.S. Trivially solvable in functional languages:

#!/usr/bin/env runhaskell
import Data.List (group)

dat = "abbbccac"

rle :: Eq a => [a] -> [(a, Int)]
rle arr = map (\g -> (head g, length g)) $ group arr

main :: IO ()
main = print $ rle dat

Solución

Here is a more imperative form. You can eliminate your duplicate code by adding or chaining to a throwaway sentinel that will never match any of your list elements, forcing an end-of-sequence pass through your "this-not-equal-last" code:

from itertools import chain

def rle(seq):
    ret = []
    sentinel = object()
    enum = enumerate(chain(seq,[sentinel]))
    start,last = next(enum)
    for i,c in enum:
        if c != last:
            ret.append((last,i-start))
            start,last = i,c
    return ret

This even gracefully handles the case where the input seq is empty, and the input can be any sequence, iterator, or generator, not just a string.

Otros consejos

Trivially solvable using the Python "batteries included":

>>> data = "abbbccac"
>>> from itertools import groupby
>>> ilen = lambda gen : sum(1 for x in gen)
>>> print [(ch, ilen(ich)) for ch,ich in groupby(data)]
[('a', 1), ('b', 3), ('c', 2), ('a', 1), ('c', 1)]

groupby returns an iterator of 2-tuples. The first is the value that represents the next group, and the second is an iterator you can use to iterate over the items in the group. You just want the length of the group, but you can't take the length directly, so I added the simple ilen lambda function to compute the length for an iterator.

Simple if I understood it right ...

from itertools import groupby

data = "abbbccac"

print [(k, len(list(g))) for k,g in groupby(data)]

Wow - got very surprising results comparing Paul's very similar function to mine. It turns out the loading the list is 10-100 times faster. Even more surprising is the that the list implementation has bigger advantage as the blocks get larger.

I guess this is why I ♥ Python - it makes the terse expression work better - even if it sometimes seems like magic.

Check the data for yourself:

from itertools import groupby
from timeit import Timer

data = "abbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccac" 

def rle_walk(gen):
    ilen = lambda gen : sum(1 for x in gen)
    return [(ch, ilen(ich)) for ch,ich in groupby(data)]

def rle_list(data):
    return [(k, len(list(g))) for k,g in groupby(data)]

# randomy data
t = Timer('rle_walk("abbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccac")', "from __main__ import rle_walk; gc.enable()")
print t.timeit(1000)

t = Timer('rle_list("abbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccacabbbccac")', "from __main__ import rle_list; gc.enable()")
print t.timeit(1000)

# chunky blocks
t = Timer('rle_walk("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccccccccccccccccccccccccccccccccc")', "from __main__ import rle_walk; gc.enable()")
print t.timeit(1000)

t = Timer('rle_list("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccccccccccccccccccccccccccccccccc")', "from __main__ import rle_list; gc.enable()")
print t.timeit(1000)

Gives me results of (smaller is better):

1.42423391342    # lambda and walk - small blocks
0.145968914032   # make list - small blocks
1.41816806793    # lambda and walk - large blocks
0.0165541172028  # make list - large blocks

You can at least get rid of the else clause if you use enumerate to keep track of how many values you've seen and store last_val and last_index (or (last_val, last_index) as a tuple). E.g.

last_index = 0
last_val = None

for index, val in enumerate(data):
    if val != last_val and last_val is not None:
        rle.append((last_val, index - last_index + 1))
        last_val = val
        last_index = index
    last_val = val
if last_val is not None:
    rle.append((last_val, index - last_index + 1))

you can also start the enumerate at any point and it'll count up (so you could initialize it with enumerate(data, last_index). It looks like you want the count to start at 1, which is why I added the + 1 part.

Enumerate just counts how many elements have been produced from the iterator, doesn't matter the type.

I prefer the groupby solution but the most convenient way to write an imperative solution is often as a generator:

data = "abbbccac"

def rle(xs):
    def g():
        last = object()
        n = 0
        for x in xs:
            if x != last:
                yield last,n
                last = x
                n = 0
            n += 1
    return list(g())[1:]

print rle(data)

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow