Pergunta

I am trying to write a bloom filter builder in PIG making use of the builtin BuildBloom and Bloom UDFs. The syntax for calling the BuildBloom UDF is:

define bb BuildBloom('hash_type', 'vector_size', 'false_positive_rate');

where the vector size and and false positive rate arguments are passed in as charrarrays. Since I don't necessarily know the vector size before hand, but it is always available within the script prior to calling the BuildBloom UDF, I want to use the builtin COUNT UDF instead of some hard-coded value. Something like:

records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE 
    (long)     $0 AS value_fld:long, 
    (chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);
define bb BuildBloom('jenkins', n, '$false_positive_rate');

The problem is that when I describe n I get: n: {count: chararray}. Predictably, the BuildBloom UDF call fails because it got a tuple as input where it expected a simple chararray. How should I pull just the chararray (i.e. the integer return from COUNT cast to a chararray) and assign that to n for use in the call to BuildBloom(...)?

EDIT: Here is the resulting error when I attempted to pass N::count into the BuildBloom(...) UDF. describe N yields: N {count: chararray}. The offending line (line 40) reads: define bb BuildBloom('jenkins', N::count, '$fpr');

ERROR 1200: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1607)
    at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1546)
    at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
    at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:991)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:604)
    at org.apache.pig.Main.main(Main.java:157)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: Failed to parse: <file buildBloomFilter.pig, line 40, column 32>  mismatched input 'N::count' expecting set null
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:235)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:177)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1599)
    ... 14 more
Foi útil?

Solução

If you're using the grunt shell then the obvious way to do this is to call DUMP n;, wait for the job to finish running and then copy the value into your define bloom... call.

That's not a very satisfying answer, I'm guessing. Most likely you'll want to run this in a script. Here is a very hacky way to do it. You'll need 3 files:

  1. 'n_start.txt' which contains:

    n='
    
  2. 'n_end.txt' which contains the single character:

    '
    
  3. 'bloom_build.pig' which contains:

    define bb BuildBloom('jenkins', '$n', '0.0001');
    

Once you have those you can run this script:

records = LOAD '$input' using PigStorage();
records = FOREACH records GENERATE 
    (long)     $0 AS value_fld:long, 
    (chararray)$1 AS filter_fld:chararray;
records_fltr = FILTER records by (filter_fld=='$filter_value') 
    AND (value_fld is not null);
records_grp = GROUP records_fltr all;
records_count = FOREACH records_grp GENERATE 
    (chararray) COUNT(records_fltr.value_fld) AS count:chararray;
n = FOREACH records_count GENERATE flatten(count);

--the new part
STORE records_count INTO 'n' USING PigStorgae(',');
--this will copy what you just stored into a local directory
fs -copyToLocal n n
--this will cat the two static files we created prior to running pig
--with the count we just generated.  it will pass it through tr which will
--strip out the newlines and then store it into a file called 'n.txt' which we
--will use as a parameter file
sh cat -s nstart.txt n/part-r-00000 nend.txt| tr -d '\n' > n.txt
--RUN makes pig call one script within another.  Be forewarned that if pig returns 
--a message with an error on a certain line, it is the line number of the expanded script
RUN -param_file n.txt bloom_bulid.pig;

After this, you can call bb as you had previously intended to do. It's ugly and possibly someone better versed in unix could get rid of the n_start.txt and n_end.txt files.

The other option that is cleaner but more involved is to write a new UDF that (like BuildBloom) extends BuildBloomBase.java but has an empty constructor and can handle everything in the exec() method.

Outras dicas

In the BuildBloom udf you are sending "n" as argument which is tuple. Might be "n::columnname" will work. Try this.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top