I will be doing a project on pagerank and inverted indexing of wikipedia dataset using apache hadoop.I downloaded the whole wiki dump - http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 .It decompresses to a single 42 Gb .xml file. I want to somehow process this file to get data suitable for input in pagerank and inverted indexing map-reduce algos. Please help! Any leads will be helpful.

有帮助吗?

解决方案

You need to write your own Inputformat to process XML. You would also need to implement a RecordReader to make sure your inputsplits have the fully formed XML chunk and not just a single line. See http://www.undercloud.org/?p=408 .

其他提示

Your question is not very clear to me. What kind of idea do you need?

The very first thing which is going to hit you is how are you going to process this xml file in your MR job. MR framework doesn't provide any built-in InputFormat for xml files. For this you might wanna have a look at this.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top