Question

I'm looking into whether a Hadoop/Impala combination will meet my archiving, batch processing and real time ad hoc query requirements.

We will be persisting XML files (which are well formed and conform to our own XSD schema) into Hadoop and using MapReduce to process end of day batch queries etc. For ad hoc user queries and application queries requiring low latency and relatively high performance we're considering Impala.

What I can't figure out is how Impala would understand the structure of the XML files so that it could query effectively. Can Impala be used to query across XML documents in a meaningful way?

Thanks in advance.

Was it helpful?

Solution

Hive and Impala don't really have a mechanism by which to work with XML files (which is odd, considering XML support in most databases).

That being said, if I were faced with this problem, I would use Pig to import the data into HCatalog. At that point, it's fully usable by Hive and Impala.

Here's a quick and dirty example of getting some XML data into HCatalog using Pig:

--rss.pig

REGISTER piggybank.jar

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);

data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS  pubdate:chararray;

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;



--Results

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)



--Impala query

select * from rss_items



--Impala results

    link    title   description pubdate
0   http://www.hannonhill.com/news/item1.html   News Item 1 Description of news item 1 here.    03 Jun 2003 09:39:21
1   http://www.hannonhill.com/news/item2.html   News Item 2 Description of news item 2 here.    30 May 2003 11:06:42
2   http://www.hannonhill.com/news/item3.html   News Item 3 Description of news item 3 here.    20 May 2003 08:56:02



--rss.txt data file

<rss version="2.0">
   <channel>
      <title>News</title>
      <link>http://www.hannonhill.com</link>
      <description>Hannon Hill News</description>
      <language>en-us</language>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <generator>Cascade Server</generator>
      <webMaster>webmaster@hannonhill.com</webMaster>
      <item>
         <title>News Item 1</title>
         <link>http://www.hannonhill.com/news/item1.html</link>
         <description>Description of news item 1 here.</description>
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item1.html</guid>
      </item>
      <item>
         <title>News Item 2</title>
         <link>http://www.hannonhill.com/news/item2.html</link>
         <description>Description of news item 2 here.</description>
         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item2.html</guid>
      </item>
      <item>
         <title>News Item 3</title>
         <link>http://www.hannonhill.com/news/item3.html</link>
         <description>Description of news item 3 here.</description>
         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item3.html</guid>
      </item>
   </channel>
</rss>

OTHER TIPS

It doesn't look like you'll have much luck with Impala and XML at the moment. Impala uses the Hive metastore, but doesn't support custom InputFormats and SerDes. You can see the formats they support natively here.

You can use Hive and the newer versions are supposed to be significantly faster (0.12+)

Alternative approach would be to quickly convert the bunch of XML's to avro and use the avro file to power the tables defined in hive or impala.

XMLSlurper can be used for parsing the records in XML file

You may try XML SerDe for Hive here

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top