Pergunta

I have a string in field 'product' in the following form:

 ";TT_RAV;44;22;" 

and am wanting to first split on the ';' and then split on the '_' so that what is returned is

  "RAV" 

I know that I can do something like this:

    parse_1 =  foreach { 
    splitup = STRSPLIT(product,';',3); 
    generate splitup.$1 as depiction; 
    }; 

This will return the string 'TT_RAV' and then I can do another split and project out the 'RAV' however this seems like it will be passing the data through multiple Map jobs -- Is it possible to parse out the desired field in one pass?

This example does NOT work, as the inner splitstring retuns tuples, but shows logic:

     c parse_1 =  foreach { 
    splitup = STRSPLIT(STRSPLIT(product,';',3),'_',1); 
    generate splitup.$1 as depiction; 
    }; 

Is it possible to do this in pure piglatin without multiple map phases?

Foi útil?

Solução

Don't use STRSPLIT. You are looking for REGEX_EXTRACT:

REGEX_EXTRACT(product, '_([^;]*);', 1) AS depiction

If it's important to be able to precisely pick out the second semicolon-delimited field and then the second underscore-delimited subfield, you can make your regex more complicated:

REGEX_EXTRACT(product, '^[^;]*;[^_;]*_([^_;]*)', 1) AS depiction

Here's a breakdown of how that regex works:

^      // Start at the beginning
[^;]*  // Match as many non-semicolons as possible, if any (first field)
;      // Match the semicolon; now we'll start the second field
[^_;]* // Match any characters in the first subfield
_      // Match the underscore; now we'll start the second subfield (what we want)
(      // Start capturing!
[^_;]* // Match any characters in the second subfield
)      // End capturing

Outras dicas

The only time there will be multiple maps is if you have an operator that triggers a reduce (JOIN, GROUP, etc...). If you run an explain on the script you can see if there is more than one reduce phase.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top