Datatype mismatch causing comparison failure? Python UDF in Pig

https://stackoverflow.com/questions/19211728

30-06-2022
|

Question

I'm having trouble with my Python UDF for use in Pig scripts. I believe the problem is that I assumed my input deltas is in a format it's not actually in, but I'm not sure how to fix it (Python n00b).

Note: On Cloudera (cdh4.3) distro of Hadoop v.2.0.0, Pig v.0.11.0, Python 2.4.3.

import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil

@outputSchema("adj:float")
def cumRelFreqAdj(deltas):

    # create bins of increment 0.01
    a = [i*-0.01 for i in range(100)]
    a = a[1:len(a)]
    b = [i*0.01 for i in range(101)]
    a.extend(b)
    a.sort()
    bins = a

    # build cumulative relative frequency distribution
    cumfreq = [0]*200
    for delta in deltas:
        for bin in range(len(bins)):
            if delta <= bins[bin]:
                cumfreq[bin] += 1

    cumrelfreq = [float(cumfreq[i]) / max(cumfreq) for i in range(len(cumfreq))]

    crf = zip(bins, cumrelfreq)

    for relfreq in crf[:]:
        if relfreq[1] > 0.11:    # 10%ile
            adj = relfreq[0] + 0.05
            break

    return adj

Do I need to convert my input to a list first?

Solution

Answered my own question. The input from Pig is a bag of tuples. In my case each tuple has one element, e.g.: {(-0.01), (-0.03), (0.00001), (-0.2383), (0.158)}.

So in order to compare it to a float-type element from another list bins, I need to insert something like:

delta = list(delta)[0]

between lines 16 & 17 above, to pull out the float-typed data element that is the content of the tuple. Then the comparison on line 18 will work.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow