Question

In the NLTK introduction book they show how to us concordance to get the context around a given word. But I want something a little more complex. Can I get the text around a certain pattern? Something like this:

text.concordances(", [A-Za-z]+ , ") ~all words surrounded by spaces and commas

Was it helpful?

Solution

In short, nltk is not able to create a concordance from a regex in its present state. The difficulty of creating a concordance from nltk's ConcordanceIndex class (or a subclass thereof)--which is what you are using--is that the class accepts a list of tokens as an argument (and is built around those tokens) rather than a full text string.

I guess my suggestion would be to create your own class, which accepts a string as an argument instead of tokens. Here is a class loosely based upon the nltk's ConcordanceIndex class that might function as a starting point:

import re


class RegExConcordanceIndex(object):
    "Class to mimic nltk's ConcordanceIndex.print_concordance."

    def __init__(self, text):
        self._text = text

    def print_concordance(self, regex, width=80, lines=25, demarcation=''):
        """
        Prints n <= @lines contexts for @regex with a context <= @width".
        Make @lines 0 to display all matches.
        Designate @demarcation to enclose matches in demarcating characters.
        """ 
        concordance = []
        matches = re.finditer(regex, self._text, flags=re.M)
        if matches:
            for match in matches:
                start, end = match.start(), match.end()
                match_width = end - start
                remaining = (width - match_width) // 2
                if start - remaining > 0:
                    context_start = self._text[start - remaining:start]
                    #  cut the string short if it contains a newline character
                    context_start = context_start.split('\n')[-1]
                else:
                    context_start = self._text[0:start + 1].split('\n')[-1]
                context_end = self._text[end:end + remaining].split('\n')[0]
                concordance.append(context_start + demarcation + self._text
                                   [start:end] + demarcation + context_end)
                if lines and len(concordance) >= lines:
                    break
            print("Displaying %s matches:" % (len(concordance)))
            print '\n'.join(concordance)
        else:
            print "No matches"

Now you can test the class like this:

>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.raw(fileids='austen-emma.txt')
>>> comma_separated = RegExConcordanceIndex(emma)
>>> comma_separated.print_concordance(r"(?<=, )[A-Za-z]+(?=,)", demarcation='**')  # matches are enclosed in double asterisks

Displaying 25 matches:
Emma Woodhouse, **handsome**, clever, and rich, with a comfortab
Emma Woodhouse, handsome, **clever**, and rich, with a comfortable home
The real evils, **indeed**, of Emma's situation were the power 
o her many enjoyments.  The danger, **however**, was at present
well-informed, **useful**, gentle, knowing all the ways of the
well-informed, useful, **gentle**, knowing all the ways of the family,
a good-humoured, **pleasant**, excellent man, that he thoroughly 
"No, **papa**, nobody thought of your walking.  We 
"I believe it is very true, my dear, **indeed**," said Mr. Woodhouse,
should not like her so well as we do, **sir**,
e none for myself, papa; but I must, **indeed**,
met with him in Broadway Lane, **when**, because it began to drizzle,
like Mr. Elton, **papa**,--I must look about for a wife for hi
"With a great deal of pleasure, **sir**, at any time," said Mr. Knightley,
better thing.  Invite him to dinner, **Emma**, and help him to the best
y.  He had received a good education, **but**,
Miss Churchill, **however**, being of age, and with the full co
From the expense of the child, **however**, he was soon relieved.
It was most unlikely, **therefore**, that he should ever want his
 strong enough to affect one so dear, **and**, as he believed,
It was, **indeed**, a highly prized letter.  Mrs. Westo
and he had, **therefore**, earnestly tried to dissuade them 
Fortunately for him, **Highbury**, including Randalls in the same par
handsome, **rich**, nor married.  Miss Bates stood in th
a real, **honest**, old-fashioned Boarding-school, wher
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top