HiveQL & XPath - how to extract value and replace some characters
Question
I have a XML blob (as shown below) stored in a hive log table.
<user>
<uid>1424324325</uid>
<attribs>
<field>
...
</field>
<field>
<name>first</name>
<value>Joh,n</value>
</field>
<field>
...
</field>
<field>
<name>last</name>
<value>D,oe</value>
</field>
<field>
...
</field>
</attribs>
</user>
Each row in the hive table would have information about different users and I want to extract the values of uid, first name and last name (removing any commas from within names).
1424324325 John Doe
1424435463 Jane Smith
I am able to extract the values from the XML.
SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
However, I am getting stumped trying to remove the unnecessary commas (if they exist) from within the first name & last name.
When I try to extract first name using any of the methods shown below, the result is empty.
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/replace(text(),",","")')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/translate(text(),",","")')) fns as fn
When I try it as shown below, replace complains about invalid function whereas translate pulls the data without removing the extra commas.
LATERAL VIEW explode(xpath(logs['users_updates'], replace('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], translate('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn
How can I extract the information without the commas in the name values?
1424324325 John Doe
1424435463 Jane Smith
Final Solution: Here is the final working query after Jens's suggestion
SELECT uid, regexp_replace(fn,","," ") as fname, regexp_replace(ln,","," ") as lname
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
Solution
There is no support for XPath 2.0 in Hive. This effects your question twice:
- Function calls in axis steps are not allowed. While
//value/translate(text(), ',', '')
(which calls translate for each<value/>
element) is valid XPath 2.0, you cannot do this in XPath 1.0.translate(//value, ',', '')
on the other hand returns all text nodes in all<value/>
items concatenated as a single string. - There is no
replace
function in XPath 1.0.
It might be easier to just pass the comma-contained values and do the string manipulation in Hive instead.
Additional note, as you haven't got XPath 2.0 anyway: translate
only expects a single string as first argument. You need to string-join
it before.