There's no need to use streaming for this. Pig can handle it already. Use the built-in UDF REGEX_EXTRACT
:
$ cat record.txt
f1 f2 M2534896R402Qnew f4
f1 f2 M2534896R987Qxyz f4
f1 f2 M2534897R421Qabc f4
f1 f2 M47Rzxcvzxcv f4
f1 f2 12345M000R f4
f1 f2 M23551Qnew f4
f1 f2 M298793R133R23Qnew f4
$ cat test.pig
raw = load 'record.txt' using PigStorage('\t') as (f1:chararray, f2:chararray, f3:chararray, f4:chararray);
ext = FOREACH raw GENERATE REGEX_EXTRACT(f3, 'M(\\d+)R', 1) AS num;
DUMP ext;
$ pig -x local test.pig
(2534896)
(2534896)
(2534897)
(47)
(000)
()
(298793)
Note that the result of REGEX_EXTRACT
is a chararray
. If you want an int
you'll have to cast it.