Question

I have the following code:

def index(self):
    """
    Build an index of the documents.
    """
    print "Indexing..."
    # ------------------------------------------------------------------
    # TODO: Create an inverted index.
    #       Granted this may not be a linked list as in a proper
    #       implementation.

    inv_index = collections.defaultdict(lambda: [])

    tam = len(self.docs)
    for word in self.vocab:
        for i in xrange(tam):
            if word in self.docs[i]:
                inv_index[word].append(i)    

    print "End indexing..."
    # ------------------------------------------------------------------
    self.inv_index = inv_index

Its successfull indexing but is too low ( around 20~ min ) , how I can do it in less than 10 seconds?

  • self.vocab: a list of all distinct (stemmed) words
  • self.docs: a list of lists, where the i-th document is
  • self.docs[i] => ['word1', 'word2', ..., 'wordN']

self.vocab are the words of my own vocabulary and I need to index the number of document where that word appears.

Was it helpful?

Solution

Use dictionary and sets:

inv_index = collections.defaultdict(set)

vocabulary = set(self.vocab)
for i, document in enumerate(self.docs):
    in_document = vocabulary & set(document)
    for word in in_document:
        inv_index[word].add(i)

OTHER TIPS

I can see two issues with your implementation:

For each word in the list, you check if that word is in a document and if it is, add an index to it.

That's correct, but you are basically reading the whole document for each word. The time complexity would be O(W x D x L) where W is the number of words, D is the number of documents and L is the average length of the documents.

We can assume that your vocabulary is composed of unique words (otherwise it wouldn't make any sense).

One improvement would be to create a set of all the words. This can be done in O(W) amortized time.

Then for every word in the document, you check if it's in the set and if it is, add it to its index. Both this operations can be done in O(1) for each word.

In total, the algorithm will now become O(W + (D x L))

Also, if your docs could be compressed by removing duplicates, you could speed up the process by the compression factor.

You should definitely convert the elements of self.docs from lists to sets. Your call if word in self.docs[i] is an operation of O(n) for a list but O(1) for sets. You can initialize your defaultdict with list, btw, i.e., defaultdict(list).

this is what you basically try to do

for each word in vocabulary:
    for doc_index, doc in enummerate documents:
        if word in document:
            index[word].append(doc_index) 

let's say you have 1000 words in the vocab and 1000 docs. this means you will run if word in document: 1000*1000 times. I think the word in document statement will read through the whole document, which is not cheap especially if document is large.

More straightforward logic:

for doc_index, doc in enummerate documents:
    for each word in doc: 
        index[word].append(doc_index)

This way you can eliminate the expensive operation of word in document.

Some remarks on this line: for each word in doc:
You'll need to tokenize your document to be able to iterate over individual words of a document. Think about something like splitting on whitespace, or if you want to have a more robust solution, I would suggest to use the nltk tokenizer module, see http://text-processing.com/demo/tokenize/ example:

import nltk
sentence = """At eight o'clock on Thursday morning Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top