How do I Translate XML to TSV Using Hadoop?

https://stackoverflow.com/questions/4696748

11-10-2019
|

Question

I have a very simply formatted XML document that I would like to translate into TSV suitable for an import into Hive. The formatting of this document is straightforward:

<root>
   <row>
      <ID>0</ID>
      <ParentID>0</ParentID>
      <Url></Url>
      <Title></Title>
      <Text></Text>
      <Username></Username>
      <Points>0</Points>
      <Type>0</Type>
      <Timestamp></Timestamp>
      <CommentCount>0</CommentCount>
   </row>
</root>

I have a working Ruby script that will translate a document formatted as above into TSVs properly. That's here:

require "rubygems"
require "crack"

xml = Crack::XML.parse(File.read("sample.xml"))

xml['root']['row'].each{ |i|
  puts "#{i['ID']}      #{i['ParentID']}        #{i['Url']}     #{i['Title']}..." 
}

Unfortunately, the files I need to translate are substantially larger than this script can handle (> 1 GB).

Which is where Hadoop comes in. The simplest solution is probably to write a MapReduce job in Java, but that's not an option given that I lack Java skills. So I wanted to write the a mapper script in either Python or Ruby which I am far from expert in, but can at least navigate.

My plan then was to do the following:

use StreamXmlRecordReader to parse the file record by record
map the deserialization using crack
reduce it with a simple regurgitation of the elements spaced by tabs

This approach has failed consistently, however. I've used a variety of Ruby/Wukong scripts with no success. Here's one based off the article here:

#!/usr/bin/env ruby

require 'rubygems'
require 'crack'

xml = nil
STDIN.each_line do |line|
  puts |line|
  line.strip!

  if line.include?("<row")
    xml = Crack::XML.parse(line)
    xml['root']['row'].each{ |i|
      puts "#{i['ID']}      #{i['ParentID']}        #{i['Url']}..."     
  else
    puts 'no line'
  end

  if line.include?("</root>")
    puts 'EOF'
  end
end

This and other jobs fail as follows:

hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2+737.jar -input /hackernews/Datasets/sample.xml -output out -mapper mapper.rb -inputreader "StreamXmlRecordReader,begin=<row,end=</row>"
packageJobJar: [/var/lib/hadoop-0.20/cache/sog/hadoop-unjar1519776523448982201/] [] /tmp/streamjob2858887307771024146.jar tmpDir=null
11/01/14 17:29:17 INFO mapred.FileInputFormat: Total input paths to process : 1
11/01/14 17:29:17 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-0.20/cache/sog/mapred/local]
11/01/14 17:29:17 INFO streaming.StreamJob: Running job: job_201101141647_0001
11/01/14 17:29:17 INFO streaming.StreamJob: To kill this job, run:
11/01/14 17:29:17 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201101141647_0001
11/01/14 17:29:17 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201101141647_0001
11/01/14 17:29:18 INFO streaming.StreamJob:  map 0%  reduce 0%
11/01/14 17:30:05 INFO streaming.StreamJob:  map 100%  reduce 100%
11/01/14 17:30:05 INFO streaming.StreamJob: To kill this job, run:
11/01/14 17:30:05 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201101141647_0001
11/01/14 17:30:05 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201101141647_0001
11/01/14 17:30:05 ERROR streaming.StreamJob: Job not Successful!
11/01/14 17:30:05 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

The first problem is that I can't tell where I'm failing: my script, or StreamXmlRecordReader.

The second problem is that I'm told by a gracious and helpful expert that because StreamXmlRecordReader doesn't produce an additional record delimiter, this approach probably isn't going to work, and that I'll need to read in single lines, grep for row, stack up everything until you get /row, and then parse it.

Is this the simplest approach, and if so, how I might best accomplish that?

Performance isn't a huge issue, because these files are batch processed every few weeks or so, just in case that helps.

Solution

If you have this problem, the folks from Infochimps have solved it. Here's the necessary Wukong script:

http://thedatachef.blogspot.com/2011/01/processing-xml-records-with-hadoop-and.html

OTHER TIPS

One often mistake is not having execute permissions on your script "chmod a+x mapper.rb" give that a try.

Take a look in your job tracker logs to get the error specifically. You can also get the info from the http://namenode:50030/jobtracker.jsp click on the failed job and then on the "Failed" in the "Failed/Killed Task Attempts" for the map.

Also when you run your stream job put "-verbose" on the option line that might give some more information.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow