Pregunta

Months ago when I was learning about Huffman coding, I used this code to try some things out. To build the frequency list I used Counter with my function genFreq_original() since I wanted a generic input (images as numpy arrays or text). However, when creating the Huffman tree of an image, there was a problem. I could bypass it by creating the frequency list with a different function, genFreq_fix(). As far as I can tell, the frequency lists of an image with either of these two functions is the same. a is the one created with genFreq_original and it fails, while b is the other one and works with everything.

The error I got while creating the tree with makeHuffTree_original() is the following:

Traceback (most recent call last):
  File "C:\Users\ip507\Desktop\huffman.py", line 67, in <module>
    makeHuffTree_original(a)
  File "C:\Users\ip507\Desktop\huffman.py", line 29, in makeHuffTree_original
    childL, childR = heappop(tree), heappop(tree)
ValueError: setting an array element with a sequence.

I printed my variables in the function for each iteration to spot any inconsistency, but nothing. From my understanding of how a Huffman tree is created, I modified the function (makeHuffTree_test()) to do without heaps; just straight sorting the list, removing the first two elements and putting them back into the list as a combined parent node. But that time I got a different error.

Traceback (most recent call last):
  File "C:\Users\ip507\Desktop\huffman.py", line 68, in <module>
    makeHuffTree_test(a)
  File "C:\Users\ip507\Desktop\huffman.py", line 37, in makeHuffTree_test
    tree = sorted(tree)
ValueError: setting an array element with a sequence.

Yet, both of these two functions give the same result when b is the input frequency list. In my attempt to trace the error, I created the function sideBySize() which is the same as makeHuffTree_original(), but the trees of both frequencies are created at the same time. By comparing the result of both in each step, I found no difference until the sudden error. Other things I have tried include:

  • a == b is always True
  • type(a) == type(b) is always True, as well as all respective individual elements
  • I created a random numpy array with the same size as the image and values between 0 and 255 in case numpy is to blame, but surprisingly, both a and b work!
  • Suggested: repr(a), repr(b). They are both equivalent to each other as well as to str(a). repr(a) just prints the contents of the variable, which is the frequency list.

I have revisited this problem quite a few times and it has always perplexed me. While there are many ways to code Huffman encoding and I have a workaround for my issue, the educational curiosity is still burning. Can someone explain this sorcery?

I'm using Python 2.7.5 on Windows 7 64bit. Here is my code. Minimum functional code posted after this.

import urllib2 as urllib
import io

import Image
import numpy as np
from collections import Counter
from heapq import heappop, heappush, heapify

def genFreq_original(data):
    if isinstance(data, (np.ndarray, np.generic)):
        freqList = Counter(data.flat)
    else:
        freqList = Counter(data)
    tree = [(freqList[key], key) for key in freqList.keys()]
    return tree

def genFreq_fix(data):
    Nx, Ny = data.shape
    freqList = np.zeros(shape=(256,), dtype=int)
    for i in xrange(Nx):
        for j in xrange(Ny):
            freqList[data[i,j]] += 1
    tree = [(freqList[i], i) for i in xrange(256) if freqList[i] != 0]
    return tree

def makeHuffTree_original(tree):
    heapify(tree)
    while len(tree) > 1:
        childL, childR = heappop(tree), heappop(tree)
        parent = (childL[0] + childR[0], childL, childR)
        heappush(tree, parent)
        #print len(tree)
    print 'Successful'

def makeHuffTree_test(tree):
    while len(tree) > 1:
        tree = sorted(tree)
        childL, childR = tree.pop(0), tree.pop(0)
        parent = (childL[0] + childR[0], childL, childR)
        tree.append(parent)
        #print len(tree)
    print 'Successful'

def sideBySide(a, b):
    heapify(a), heapify(b)
    while len(a) > 1 and len(b) > 1:
        d1, d2 = heappop(b), heappop(b)
        c1, c2 = heappop(a), heappop(a)
        p2 = (d1[0] + d2[0], d1, d2)
        p1 = (c1[0] + c2[0], c1, c2)
        heappush(b, p2)
        heappush(a, p1)
        #print len(a), all([a == b, p1 == p2, c1 == d1, c2 == d2])
    print 'Successful'

url = urllib.urlopen('http://media-cdn.tripadvisor.com/media/photo-s/02/e7/dc/22/akumal-beach-resort.jpg')
fname = io.BytesIO(url.read())

dat = np.asarray(Image.open(fname).convert('L'), dtype=int)  # image array
#dat = (np.random.random(dat.shape)*256).astype(int)          # random numpy array

a = genFreq_original(dat)                                    # This list fails
b = genFreq_fix(dat)                                         # This list works
#print 'a == b: ', a == b

#print makeHuffTree_test(b) == makeHuffTree_original(b)       # same result with both functions
makeHuffTree_original(a)                                     # heappop error
#makeHuffTree_test(a)                                         # sorted(list) error
#sideBySide(a, b)                                             # step-by-step comparison of a and b

Minimal functional code

The question boils down to this: given two variables, a and b, which seem identical, why does makeHuffTree_original() behave differently to both of them?

import urllib2 as urllib
import io

import Image
import numpy as np
from collections import Counter
from heapq import heappop, heappush, heapify

def genFreq_original(data):
    if isinstance(data, (np.ndarray, np.generic)):
        freqList = Counter(data.flat)
    else:
        freqList = Counter(data)
    tree = [(freqList[key], key) for key in freqList.keys()]
    return tree

def genFreq_fix(data):
    Nx, Ny = data.shape
    freqList = np.zeros(shape=(256,), dtype=int)
    for i in xrange(Nx):
        for j in xrange(Ny):
            freqList[data[i,j]] += 1
    tree = [(freqList[i], i) for i in xrange(256) if freqList[i] != 0]
    return tree

def makeHuffTree_original(tree):
    heapify(tree)
    while len(tree) > 1:
        childL, childR = heappop(tree), heappop(tree)
        parent = (childL[0] + childR[0], childL, childR)
        heappush(tree, parent)
        #print len(tree)
    print 'Successful'

url = urllib.urlopen('http://media-cdn.tripadvisor.com/media/photo-s/02/e7/dc/22/akumal-beach-resort.jpg')
fname = io.BytesIO(url.read())

dat = np.asarray(Image.open(fname).convert('L'), dtype=int)  # image array

a = genFreq_original(dat)           # This list fails
b = genFreq_fix(dat)                # This list works
print 'a == b:', a == b

makeHuffTree_original(b)            # Works
makeHuffTree_original(a)            # heappop error
¿Fue útil?

Solución

The problem results from a combination of the following factors. First, there are subtle type differences between the elements of a and b:

>>> map(type, a[0])
[<type 'int'>, <type 'numpy.int32'>]
>>> map(type, b[0])
[<type 'numpy.int32'>, <type 'int'>]

a is a list of (int, numpy.int32) tuples, while b is a list of (numpy.int32, int) tuples. (At least, these are the types I get when I initialized dat as dat = (np.random.random((256, 256))*256).astype(int). I couldn't replicate what you were doing with Image.)

Second, while an ordinary Python int will return a consistent but nonsense result when compared with a sequence, NumPy types attempt to broadcast the comparison:

>>> 1 < [1, 1]
True
>>> numpy.int32(1) < [1, 1]
array([False, False], dtype=bool)
>>> numpy.int32(1) < [1, [1]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.

If the sequence isn't a valid array-like object, the comparison will fail with the error you saw.

Third, your heap relies on Python's consistent nonsense comparisons for incompatible types. If two heap elements have the same frequency, Python will attempt to compare the remaining elements of the tuples. Those elements don't have a meaningful comparison. With ordinary Python types, this wouldn't be too much of a problem, but when NumPy types enter the picture, things break. (In Python 3, this would break even without NumPy, since incompatible types raise TypeError on comparison.)

The fix is to prevent comparison of the incompatible objects. Instead, heap elements with the same frequency should be ordered by a different, arbitrary order. The standard way is to insert a unique ID between the elements you want compared and the elements you don't:

def genFreq(data):
    if isinstance(data, (np.ndarray, np.generic)):
        freqList = Counter(data.flat)
    else:
        freqList = Counter(data)
    tree = [(freqList[key], i, key) for i, key in enumerate(freqList)]
    return tree

You'll need to make sure the elements you push in the Huffman tree creation routine have unique IDs, too:

import itertools

def makeHuffTree(tree):
    heapify(tree)
    ids = itertools.count(len(tree))
    while len(tree) > 1:
        childL, childR = heappop(tree), heappop(tree)
        parent = (childL[0] + childR[0], next(ids), childL, childR)
        heappush(tree, parent)
        #print len(tree)
    print 'Successful'
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top