I am new to PIG and shell pattern match.

I have a file and the 3rd column has content like "M2534896R402Qnew". I need to extract the number between 'M' and 'R'.

In the PIG script,

raw = load 'record.txt' using PigStorage('\t') as (chararray, chararray,chararray,chararray);
data = stream raw through `shell command`;

How can I change the 3rd column so that all data's 3rd column are the number extracted from raw?

Thanks.

有帮助吗?

解决方案

There's no need to use streaming for this. Pig can handle it already. Use the built-in UDF REGEX_EXTRACT:

$ cat record.txt
f1      f2      M2534896R402Qnew        f4
f1      f2      M2534896R987Qxyz        f4
f1      f2      M2534897R421Qabc        f4
f1      f2      M47Rzxcvzxcv    f4
f1      f2      12345M000R      f4
f1      f2      M23551Qnew      f4
f1      f2      M298793R133R23Qnew      f4

$ cat test.pig
raw = load 'record.txt' using PigStorage('\t') as (f1:chararray, f2:chararray, f3:chararray, f4:chararray);
ext = FOREACH raw GENERATE REGEX_EXTRACT(f3, 'M(\\d+)R', 1) AS num;
DUMP ext;

$ pig -x local test.pig
(2534896)
(2534896)
(2534897)
(47)
(000)
()
(298793)

Note that the result of REGEX_EXTRACT is a chararray. If you want an int you'll have to cast it.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top