Use fields in one tuplestream as part of regex in RegexParser on second tuplestream

https://stackoverflow.com/questions/22752394

24-06-2023
|

Question

I'm trying to read in a csv in the hdfs, parse it with cascading, and then use the resulting tuple stream to form the basis of regex expressions in another tuple stream using RegexParser. As far as I can tell, the only way to do this would be to write a custom Function of my own, and I was wondering if anybody knew how to use the Java API to do this instead.

Pointers on how to write my own function to do this inside the cascading framework would be welcome, too.

I'm running Cascading 2.5.1

Solution

The best resource for this question is the Palo Alto cascading example tutorial. It's in java and provides examples of a lot of use cases, including writing custom functions.

https://github.com/Cascading/CoPA/wiki

And yes, writing a function that allows an input regex that references other argument inputs is your best option.

public class SampleFunction extends BaseOperation implements Function
{
     public void operate( FlowProcess flowProcess, FunctionCall functionCall )
     {
         TupleEntry argument = functionCall.getArguments();
         String regex = argument.getString( 0 );
         String argument = argument.getString( 1 );
         String parsed = someRegexOperation();

         Tuple result = new Tuple();
         result.add( parsed );
         functionCall.getOutputCollector().add( result );
     }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow