Question

I am new to PIG and shell pattern match.

I have a file and the 3rd column has content like "M2534896R402Qnew". I need to extract the number between 'M' and 'R'.

In the PIG script,

raw = load 'record.txt' using PigStorage('\t') as (chararray, chararray,chararray,chararray);
data = stream raw through `shell command`;

How can I change the 3rd column so that all data's 3rd column are the number extracted from raw?

Thanks.

Était-ce utile?

La solution

There's no need to use streaming for this. Pig can handle it already. Use the built-in UDF REGEX_EXTRACT:

$ cat record.txt
f1      f2      M2534896R402Qnew        f4
f1      f2      M2534896R987Qxyz        f4
f1      f2      M2534897R421Qabc        f4
f1      f2      M47Rzxcvzxcv    f4
f1      f2      12345M000R      f4
f1      f2      M23551Qnew      f4
f1      f2      M298793R133R23Qnew      f4

$ cat test.pig
raw = load 'record.txt' using PigStorage('\t') as (f1:chararray, f2:chararray, f3:chararray, f4:chararray);
ext = FOREACH raw GENERATE REGEX_EXTRACT(f3, 'M(\\d+)R', 1) AS num;
DUMP ext;

$ pig -x local test.pig
(2534896)
(2534896)
(2534897)
(47)
(000)
()
(298793)

Note that the result of REGEX_EXTRACT is a chararray. If you want an int you'll have to cast it.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top