
I am trying to write a bloom filter builder in PIG making use of the builtin BuildBloom and Bloom UDFs. The syntax for calling the BuildBloom UDF is:

define bb BuildBloom('hash_type', 'vector_size', 'false_positive_rate');

where the vector size and and false positive rate arguments are passed in as charrarrays. Since I don't necessarily know the vector size before hand, but it is always available within the script prior to calling the BuildBloom UDF, I want to use the builtin COUNT UDF instead of some hard-coded value. Something like:

records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE 
    (long)     $0 AS value_fld:long, 
    (chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);
define bb BuildBloom('jenkins', n, '$false_positive_rate');

The problem is that when I describe n I get: n: {count: chararray}. Predictably, the BuildBloom UDF call fails because it got a tuple as input where it expected a simple chararray. How should I pull just the chararray (i.e. the integer return from COUNT cast to a chararray) and assign that to n for use in the call to BuildBloom(...)?

EDIT: Here is the resulting error when I attempted to pass N::count into the BuildBloom(...) UDF. describe N yields: N {count: chararray}. The offending line (line 40) reads: define bb BuildBloom('jenkins', N::count, '$fpr');

ERROR 1200: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
    at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:604)
    at org.apache.pig.Main.main(Main.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: Failed to parse: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
    ... 14 more
도움이 되었습니까?


If you're using the grunt shell then the obvious way to do this is to call DUMP n;, wait for the job to finish running and then copy the value into your define bloom... call.

That's not a very satisfying answer, I'm guessing. Most likely you'll want to run this in a script. Here is a very hacky way to do it. You'll need 3 files:

  1. 'n_start.txt' which contains:

  2. 'n_end.txt' which contains the single character:

  3. 'bloom_build.pig' which contains:

    define bb BuildBloom('jenkins', '$n', '0.0001');

Once you have those you can run this script:

records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE 
    (long)     $0 AS value_fld:long, 
    (chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') 
    AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE 
    (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);

--the new part
STORE records_count INTO 'n' USING PigStorgae(',');
--this will copy what you just stored into a local directory
fs -copyToLocal n n
--this will cat the two static files we created prior to running pig
--with the count we just generated.  it will pass it through tr which will
--strip out the newlines and then store it into a file called 'n.txt' which we
--will use as a parameter file
sh cat -s nstart.txt n/part-r-00000 nend.txt| tr -d '\n' > n.txt
--RUN makes pig call one script within another.  Be forewarned that if pig returns 
--a message with an error on a certain line, it is the line number of the expanded script
RUN -param_file n.txt bloom_bulid.pig;

After this, you can call bb as you had previously intended to do. It's ugly and possibly someone better versed in unix could get rid of the n_start.txt and n_end.txt files.

The other option that is cleaner but more involved is to write a new UDF that (like BuildBloom) extends BuildBloomBase.java but has an empty constructor and can handle everything in the exec() method.

다른 팁

In the BuildBloom udf you are sending "n" as argument which is tuple. Might be "n::columnname" will work. Try this.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top