How to improve the performance when working with wikipedia data and huge no. of webpages?

Question 1

Based on your question:

Number of Documents = 110000

Dictionary => List of [TermID, Title Terms] = 40million entries

Size of documents = 11000 * 1KB per document on an average =  26.9GB
(1KB per document on an average)

Size of dictionary = 40million * 256bytes = 9.5GB of raw data
(256bytes per entry on an average)

How did you arrive at the 75 days estimate?

There are number of performance targets:

How are you storing the Documents?
How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)
How many machines are you running it on?
Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )

Here is an outline of what I believe you are doing:

dictionary = read wikipedia dictionary
document = a sequence of documents
documents.map { doc =>
  var docTermFreq = Map[String, Int]()
  for(term <- doc.terms.map if(dictionary.contains(term)) ) {
     docTermFreq = docTermFreq + (term -> docTermFreq.getOrElse(term, 0) + 1)
  }
  // store docTermFreq map
}

What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.

This is exactly what a Lucene Analyzer does.

A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.

I would do it like this:

Take every document and prepare a token stream ( using an Analyzer described above )
Index the document terms.
At this point you will have wikipedia terms only, in the Lucene Index.

When you do this, you will have ready-made statistics from the Lucene Index such as:

Document Frequency of a Term
TermFrequencyVector ( exactly what you need )
and a ready to use inverted index! ( for a quick introduction to Inverted Index and Retrieval )

There are lot of things you can do to improve the performance. For example:

Parallelize the document stream processing.
You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.

I hope that helps.

Question 2

One of the ways that use only MR is to:

Assuming you already have N dictionaries of smaller size that fit to memory you can: Launch N "map only" jobs that will be scanning all your data (each one with only one dictionary) and output smth like {pageId, termId, occurence, etc} to folder /your_tmp_folder/N/ As a result you will have N*M files where M is amount of mappers on each stage(should be the same).

Then second job will simply analyze your {pageId, termId, occurence, etc} objects and build stats per page id.

Map only jobs should be very fast in your case. If not - please paste your code.