Pergunta

I have a XML blob (as shown below) store in a hive log table.

<user>
    <uid>1424324325</uid>
    <attribs>
        <field>
        ...
        </field>
        <field>
            <name>first</name>
            <value>John</value>
        </field>
        <field>
        ...
        </field>
        <field>
            <name>last</name>
            <value>Doe</value>
        </field>
        <field>
        ...
        </field>
    </attribs>
</user>

Each row in the hive table would have information about different users and I want to extract the values of uid, first name and last name.

1424324325  John    Doe
1424435463  Jane    Smith

Extracting uid value is pretty simple. However, I am getting stumped trying to extract the first name & last name. The problem is about identifying the first name & last name pair and extracting the value.

I tried to extract the first name & last name as shown below but I got an error saying this is an invalid expression.

SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field/name/text()="first"/../value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field/name/text()="last"/../value/text()')) lns as ln;

I thought of using a hard-coded expression for the field node as shown below but the problem is that different records will have the first name and last name values in different locations.

LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[5]/value/text()')) fns as fn

When I try to extract first name as shown below, the result is empty.

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field/name/[text()="last"]/following-sibling::value[1]/text()')) fns as fn

How can I extract the information I want like below?

1424324325  John    Doe
1424435463  Jane    Smith

Thanks in advance.

Foi útil?

Solução

The following XPath should give you the correct result. Your syntax was incorrect (a predicate (i.e. everything in brackets) needs to be with an element, but you just did a child step using /).

/users/attribs/field[name = "first"]/value/string()

Furthermore, some points to improve your query:

  • You do not have to use text() when comparing node values, it will be done automatically and it is quite likely faster
  • You almost always want to use string() over text()
  • I rewrote the query, I think this is much simpler and cleaner, because it basically says "Give me the value node, which has a name node with the value first". Your attempt using following-sibling or the parent are also valid, I just think this one is simpler to read.
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top