Can Impala query XML files stored in Hadoop/HDFS

Question 1

Hive and Impala don't really have a mechanism by which to work with XML files (which is odd, considering XML support in most databases).

That being said, if I were faced with this problem, I would use Pig to import the data into HCatalog. At that point, it's fully usable by Hive and Impala.

Here's a quick and dirty example of getting some XML data into HCatalog using Pig:

--rss.pig

REGISTER piggybank.jar

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);

data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS  pubdate:chararray;

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;

--Results

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)

--Impala query

select * from rss_items

--Impala results

    link    title   description pubdate
0   http://www.hannonhill.com/news/item1.html   News Item 1 Description of news item 1 here.    03 Jun 2003 09:39:21
1   http://www.hannonhill.com/news/item2.html   News Item 2 Description of news item 2 here.    30 May 2003 11:06:42
2   http://www.hannonhill.com/news/item3.html   News Item 3 Description of news item 3 here.    20 May 2003 08:56:02

--rss.txt data file

<rss version="2.0">
   <channel>
      <title>News</title>
      <link>http://www.hannonhill.com</link>
      <description>Hannon Hill News</description>
      <language>en-us</language>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <generator>Cascade Server</generator>
      <webMaster>webmaster@hannonhill.com</webMaster>
      <item>
         <title>News Item 1</title>
         <link>http://www.hannonhill.com/news/item1.html</link>
         <description>Description of news item 1 here.</description>
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item1.html</guid>
      </item>
      <item>
         <title>News Item 2</title>
         <link>http://www.hannonhill.com/news/item2.html</link>
         <description>Description of news item 2 here.</description>
         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item2.html</guid>
      </item>
      <item>
         <title>News Item 3</title>
         <link>http://www.hannonhill.com/news/item3.html</link>
         <description>Description of news item 3 here.</description>
         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item3.html</guid>
      </item>
   </channel>
</rss>

Question 2

It doesn't look like you'll have much luck with Impala and XML at the moment. Impala uses the Hive metastore, but doesn't support custom InputFormats and SerDes. You can see the formats they support natively here.

You can use Hive and the newer versions are supposed to be significantly faster (0.12+)

Question 3

Alternative approach would be to quickly convert the bunch of XML's to avro and use the avro file to power the tables defined in hive or impala.

XMLSlurper can be used for parsing the records in XML file

Question 4

You may try XML SerDe for Hive here