Pergunta

I need to strip out the third and subsequent values in the 'bracketed' component of the user agent string.

In order to get

Mozilla/4.0 (compatible; MSIE 8.0)

from

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)

I successfully use sed command

 sed 's/(\([^;]\+; [^;]\+\)[^)]*)/(\1)/'

I need to get the same result in Apache Pig with a Java regex. Could anybody help me to re-write the above sed regular expression into Java?

Something like:

new = FOREACH userAgent GENERATE FLATTEN(EXTRACT(userAgent, 'JAVA REGEX?') as (term:chararray);
Foi útil?

Solução

I don't use Pig, but a look through the docs reveals a REPLACE function which wraps Java's replaceAll() method. Try this:

REPLACE(userAgent, '\(([^;]+; [^;]+)[^)]*\)', '($1)')

That should match the whole parenthesized portion of the UserAgent string and replace its contents with just the first two semicolon-separated terms, just like your sed command does.

Outras dicas

In java if you use the Matcher class you can extract the capturing group. The following appears to do what you want, at least for the test case you provided.

import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Test {

    public static void main(String[] args){
        String str = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; GTB6; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; WinTSI 06.12.2009; .NET CLR 3.0.30729; .NET4.0C)";
        //str = "aaa";
        Pattern pat = Pattern.compile("(.*\\(.*?;.*?;).*\\)");
        Matcher m = pat.matcher(str);
        System.out.println(m.lookingAt());
        String group = m.group(1) + ")";
        System.out.println(group);
    }
 }

Hmm... I seemed to have answered the wrong question, since you were asking how to do this from 'PIG' not straight JAVA.

As none of two suggested solutions seems to work in PIG I will post workaround which uses sed through stream:

user_agent_mangled = STREAM logs THROUGH `sed 's/(\\([^;]\\+; [^;]\\+\\)[^)]*)/(\\1)/'`;

This works well, however I would still prefer native PIG solution (using EXTRACT or REPLACE function).

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top