extract only certain tags in xml file using pig latin

https://stackoverflow.com/questions/20894951

apache-pig

23-09-2022
|

Question

I want to extract only the states from the below xml file.

<.Table>

 <State>Florida</State>

 <id>123</id>

<./Table>

<.Table>

 <State>Texas</State>

 <id>456</id>

<./Table>

Expected output :

(Florida)

(Texas)

But with the below pig statements I get

()

() as output

A = LOAD 'hdfs:/user.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Table') AS (x:chararray);

B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(x,

'<Table>\\n\\s*<State>(.*)</State>\\n\\s*\\n\\s*</Table>')) 

as (state:chararray);

Please help me understand where I have gone wrong or how do I eliminate a certain tag line?

No correct solution

OTHER TIPS

That looks like a buggy regex, after the closing </State> you are using \\n\\s*\\n\\s*</Table> which seems to ignore the the <id>...</id> elements. Have you looked at using some XML parsing library in a UDF? It might be easier than trying to build a bunch of regexes by hand.

EDIT: One other suggestion. Are you sure that the line separators in your file are just \n, you may have \r\n as the separator, in which case [\r\n]+ should help see this post for more details.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow