If you want to use the nltk's functionality, you can use nltk's ConcordanceIndex
. In order to base the width of the display on the number of words instead of the number of characters (the latter being the default for ConcordanceIndex.print_concordance
), you can merely create a subclass of ConcordanceIndex
with something like this:
from nltk import ConcordanceIndex
class ConcordanceIndex2(ConcordanceIndex):
def create_concordance(self, word, token_width=13):
"Returns a list of contexts for @word with a context <= @token_width"
half_width = token_width // 2
contexts = []
for i, token in enumerate(self._tokens):
if token == word:
start = i - half_width if i >= half_width else 0
context = self._tokens[start:i + half_width + 1]
contexts.append(context)
return contexts
Then you can obtain your results like this:
>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.' # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley') # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]
The create_concordance
method I created above is based upon the nltk's ConcordanceIndex.print_concordance
method, which works like this:
>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
valley , whereas the giraffe merely turn
and clumsily loped away from the valley into the nearby ravine .