Question

I have a XML blob (as shown below) stored in a hive log table.

<user>
    <uid>1424324325</uid>
    <attribs>
        <field>
        ...
        </field>
        <field>
            <name>first</name>
            <value>Joh,n</value>
        </field>
        <field>
        ...
        </field>
        <field>
            <name>last</name>
            <value>D,oe</value>
        </field>
        <field>
        ...
        </field>
    </attribs>
</user>

Each row in the hive table would have information about different users and I want to extract the values of uid, first name and last name (removing any commas from within names).

1424324325  John    Doe
1424435463  Jane    Smith

I am able to extract the values from the XML.

SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;

However, I am getting stumped trying to remove the unnecessary commas (if they exist) from within the first name & last name.

When I try to extract first name using any of the methods shown below, the result is empty.

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/replace(text(),",","")')) fns as fn

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/translate(text(),",","")')) fns as fn

When I try it as shown below, replace complains about invalid function whereas translate pulls the data without removing the extra commas.

LATERAL VIEW explode(xpath(logs['users_updates'], replace('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn

LATERAL VIEW explode(xpath(logs['users_updates'], translate('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn

How can I extract the information without the commas in the name values?

1424324325  John    Doe
1424435463  Jane    Smith

Final Solution: Here is the final working query after Jens's suggestion

SELECT uid, regexp_replace(fn,","," ") as fname, regexp_replace(ln,","," ") as lname
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
Was it helpful?

Solution

There is no support for XPath 2.0 in Hive. This effects your question twice:

  • Function calls in axis steps are not allowed. While //value/translate(text(), ',', '') (which calls translate for each <value/> element) is valid XPath 2.0, you cannot do this in XPath 1.0. translate(//value, ',', '') on the other hand returns all text nodes in all <value/> items concatenated as a single string.
  • There is no replace function in XPath 1.0.

It might be easier to just pass the comma-contained values and do the string manipulation in Hive instead.

Additional note, as you haven't got XPath 2.0 anyway: translate only expects a single string as first argument. You need to string-join it before.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top