문제

how do i extract the last element of a tuple/bag in Pig?

I have a String filed in a relation in Pig.

I want to extract the last token of this string as a new field. How should I do that?

Example:

our relation is

(id:int, description:chararray)

The description field is a long string and the last token of this string is the last name of the person with the id, e.g.

(123,' here is the description for John Edwards');

What i want is to extract the last name from this string as a separate field and have the following relation

(id:int, lastname:chararray)

i.e.

(1234,'Edwards')
도움이 되었습니까?

해결책

For the solution let us assume that your input relation is called data

data = LOAD 'data' AS (id:int, description:chararray);

lastName = FOREACH data GENERATE id,REGEX_EXTRACT(description,'\\s([A-Za-z]+)$',1) as lastname:chararray;

This should extract the last word out of the string in question.

다른 팁

Since the question is about finding the last element in the bag you can use the below code that applies to a slightly different data set:

{"uid":"23423423423","payments":[{"timestamp":"2014-11-12 10:21","payment_id":1,"data":"payment 1 data"},{"timestamp":"2014-12-12 07:20","payment_id":2,"data":"payment 2 data"}]}

Pig script would look like this:

data = LOAD '$INPUT' 
    USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json: map[]);

data = FOREACH data GENERATE 
    json#'uid' as uid:chararray,
    json#'payments' as payments:bag{};

row = FOREACH data {
    item = ORDER payments BY * DESC;
    item = LIMIT item 1;
    item = FOREACH item GENERATE $0 as arr:map[];
    item = FOREACH item GENERATE 
        arr#'timestamp' as timestamp:chararray,
        arr#'payment_id' as payment_id:int,
        arr#'data' as data:chararray;
    GENERATE uid, FLATTEN(item) as (timestamp, payment_id, data);
};

DUMP row;
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top