using wikipedia dataset for pagerank in hadoop

https://stackoverflow.com/questions/17432372

02-06-2022
|

Question

I will be doing a project on pagerank and inverted indexing of wikipedia dataset using apache hadoop.I downloaded the whole wiki dump - http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 .It decompresses to a single 42 Gb .xml file. I want to somehow process this file to get data suitable for input in pagerank and inverted indexing map-reduce algos. Please help! Any leads will be helpful.

Solution

You need to write your own Inputformat to process XML. You would also need to implement a RecordReader to make sure your inputsplits have the fully formed XML chunk and not just a single line. See http://www.undercloud.org/?p=408 .

OTHER TIPS

Your question is not very clear to me. What kind of idea do you need?

The very first thing which is going to hit you is how are you going to process this xml file in your MR job. MR framework doesn't provide any built-in InputFormat for xml files. For this you might wanna have a look at this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow