Вопрос

how do i extract the last element of a tuple/bag in Pig?

I have a String filed in a relation in Pig.

I want to extract the last token of this string as a new field. How should I do that?

Example:

our relation is

(id:int, description:chararray)

The description field is a long string and the last token of this string is the last name of the person with the id, e.g.

(123,' here is the description for John Edwards');

What i want is to extract the last name from this string as a separate field and have the following relation

(id:int, lastname:chararray)

i.e.

(1234,'Edwards')
Это было полезно?

Решение

For the solution let us assume that your input relation is called data

data = LOAD 'data' AS (id:int, description:chararray);

lastName = FOREACH data GENERATE id,REGEX_EXTRACT(description,'\\s([A-Za-z]+)$',1) as lastname:chararray;

This should extract the last word out of the string in question.

Другие советы

Since the question is about finding the last element in the bag you can use the below code that applies to a slightly different data set:

{"uid":"23423423423","payments":[{"timestamp":"2014-11-12 10:21","payment_id":1,"data":"payment 1 data"},{"timestamp":"2014-12-12 07:20","payment_id":2,"data":"payment 2 data"}]}

Pig script would look like this:

data = LOAD '$INPUT' 
    USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json: map[]);

data = FOREACH data GENERATE 
    json#'uid' as uid:chararray,
    json#'payments' as payments:bag{};

row = FOREACH data {
    item = ORDER payments BY * DESC;
    item = LIMIT item 1;
    item = FOREACH item GENERATE $0 as arr:map[];
    item = FOREACH item GENERATE 
        arr#'timestamp' as timestamp:chararray,
        arr#'payment_id' as payment_id:int,
        arr#'data' as data:chararray;
    GENERATE uid, FLATTEN(item) as (timestamp, payment_id, data);
};

DUMP row;
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top