Selecting the most fluent text from a set of possibilities via grammar checking (Python)

https://stackoverflow.com/questions/8931167

17-04-2021
|

Frage

Some background

I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and Natural Language Processing knowledge come only from teaching myself things through the internet. I've been working with this stuff for about a year, so I'm not helpless, but at various points I've had trouble moving forward in this project. Currently, I am entering the final phases of development, and have hit a little roadblock.

I need to implement some form of grammatical normalization, so that the output doesn't come out as un- conjugated/inflected caveman-speak. About a month ago some friendly folks on SO gave me some advice on how I might solve this issue by using an ngram language modeller, basically -- but I'm looking for yet other solutions, as it seems that NLTK's NgramModeler is not fit for my needs. (The possibilities of POS tagging were also mentioned, but my text may be too fragmentary and strange for an implementation of such to come easy, given my amateur-ness.)

Perhaps I need something like AtD, but hopefully less complex

I think need something that works like After the Deadline or Queequeg, but neither of these seem exactly right. Queequeg is probably not a good fit -- it was written in 2003 for Unix and I can't get it working on Windows for the life of me (have tried everything). But I like that all it checks for is proper verb conjugation and number agreement.

On the other hand, AtD is much more rigorous, offering more capabilities than I need. But I can't seem to get the python bindings for it working. (I get 502 errors from the AtD server, which I'm sure are easy to fix, but my application is going to be online, and I'd rather avoid depending on another server. I can't afford to run an AtD server myself, because the number of "services" my application is going to require of my web host is already threatening to cause problems in getting this application hosted cheaply.)

Things I'd like to avoid

Building Ngram language models myself doesn't seem right for the task. my application throws a lot of unknown vocabulary, skewing all the results. (Unless I use a corpus that's so large that it runs way too slow for my application -- the application needs to be pretty snappy.)

Strictly checking grammar is neither right for the task. the grammar doesn't need to be perfect, and the sentences don't have to be any more sensible than the kind of English-like jibberish that you can generate using ngrams. Even if it's jibberish, I just need to enforce verb conjugation, number agreement, and do things like remove extra articles.

In fact, I don't even need any kind of suggestions for corrections. I think all I need is for something to tally up how many errors seem to occur in each sentence in a group of possible sentences, so I can sort by their score and pick the one with the least grammatical issues.

A simple solution? Scoring fluency by detecting obvious errors

If a script exists that takes care of all this, I'd be overjoyed (I haven't found one yet). I can write code for what I can't find, of course; I'm looking for advice on how to optimize my approach.

Let's say we have a tiny bit of text already laid out:

existing_text = "The old river"

Now let's say my script needs to figure out which inflection of the verb "to bear" could come next. I'm open to suggestions about this routine. But I need help mostly with step #2, rating fluency by tallying grammatical errors:

Use the Verb Conjugation methods in NodeBox Linguistics to come up with all conjugations of this verb; ['bear', 'bears', 'bearing', 'bore', 'borne'].
Iterate over the possibilities, (shallowly) checking the grammar of the string resulting from existing_text + " " + possibility ("The old river bear", "The old river bears", etc). Tally the error count for each construction. In this case the only construction to raise an error, seemingly, would be "The old river bear".
Wrapping up should be easy... Of the possibilities with the lowest error count, select randomly.

Lösung

Grammar Checking with Link Grammar

Intro to Link Grammar

Link Grammar, developed by Davy Temperley, Daniel Sleator, and John Lafferty, is a syntactic parser of English: "Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.)." You can read more about Link Grammar and interact with an online parser here.

A few years ago AbiWord took the project over. They explain that AbiWord uses Link Grammar to check grammar. I don't know the intricacies of how AbiWord in fact performs their grammar checking, but I read about the basic approach in a Q&A about grammar-checking (the link to which I have now lost). Unlike some other parsers I've interacted with, Link Grammar produces very different results when a sentence is not grammatically well-formed: it can't find a complete linkage for grammatically improper sentences.

You can see for yourself with the online parser: entering the sentence "This is the man whose dog I bought" produces 1 linkage, whereas "This are the man whose dog I bought" produces no complete linkages.

This does not "count" the number of errors, like I asked for. However it does fulfill the original for a way to rule out grammatically implausible (i.e. improperly conjugated) possibilities.

Python bindings: they exist!

Link Grammar is written in C. This presented a problem for me when I was first researching, as I am only a year into Python coding and I would be hard-pressed to create bindings myself. I was also worried about my process/service count, so I didn't want to run the Link Grammar program on top of my Python process. But a day or two after posting this question on Jan. 13th, I came across Jeff Elmore's (enzondio) contribution of pylinkgrammar to PyPi -- which happened only a day prior.

As the pylinkgrammar page explains, you still have to build and install linkgrammar first. Instructions on how to use it are on that page. But some cautions about the installation of pylinkgrammar:

I was not able to get pylinkgrammar working on Python 2.7 with Windows 7, which I think is due to issues getting CMake working with Python 2.7 on Windows 7.
Thus I moved my entire project into Ubuntu (10.10), because I needed this that bad. But when I set up Ubuntu, I tried to install everything for Python 2.7 (even removed 2.6). I still could not get pylinkgrammar working with Python 2.7. I think this was still due to issues between CMake & Python 2.7.
I started over with my Ubuntu install because things had gotten messy, and instead set everything up with Python 2.6. I've now gotten pylinkgrammar working with Python 2.6. (But I do have to type from pylinkgrammar.linkgrammar import Parser, which differs slightly from the pypi page's instructions).

NodeBox Linguistics: the other part of my solution

In my question I stated the need to generate all inflections/conjugations of a given sentence, in order to check all of these variations and eliminate grammatically implausible items. (I'm using WordNet to change certain pieces of user inputs before outputting, and the WordNet results are uninflected; they need to be inflected to make the outputs (more) intelligible).

A very informative blogpost led me to the NodeBox Linguistics library, a set of tools with which "you can do grammar inflection and semantic operations on English content." Indeed the library can be used to conjugate verbs, singularize, and pluralize nouns, among many other operations. This is just what I needed. My application knows which words in an input it has swapped out for new, uninflected language; these pieces are the ones it generates variations for, using the methods in NodeBox Linguistics.

I feed these variations into pylinkgrammar and drop variations for which no complete linkages can be found. Sometimes this produces no results at all, but more often than not, it produces useful results. Please note that Link Grammar will not find complete linkages for most incomplete sentences. If you want to check the conjugations in fragmented sentences like I do, try extending fragmented sentences with filler before checking, then drop off the filler before outputting. I get this "filler" by taking the last word from the data, looking it up in the Brown corpus, and appending rest of that sentence from the corpus.

I don't have any tests to report about how accurate this approach is statistically, but it has worked for my (peculiar) purposes most of the time. I am still in the process of fleshing out this implementation and writing exceptional cases and ways to clean up the input data. Hopefully this information helps someone else out too! Please don't hesitate to ask for clarification.

Andere Tipps

Very cool project, first of all.

I found a java grammar checker. I've never used it but the docs claim it can run as a server. Both java and listening to a port should be supported basically anywhere.

I'm just getting into NLP with a CS background so I wouldn't mind going into more detail to help you integrate whatever you decide on using. Feel free to ask for more detail.

Another approach would be to use what is called an overgenerate and rank approach. In the first step you have your poetry generator generate multiple candidate generations. Then using a service like Amazon's Mechanical Turk to collect human judgments of fluency. I would actually suggest collecting simultaneous judgments for a number of sentences generated from the same seed conditions. Lastly, you extract features from the generated sentences (presumably using some form of syntactic parser) to train a model to rate or classify question quality. You could even thrown in the heuristics listed above.

Michael Heilman uses this approach for question generation. For more details, read these papers: Good Question! Statistical Ranking for Question Generation and Rating Computer-Generated Questions with Mechanical Turk.

The pylinkgrammar link provided above is a bit out of date. It points to version 0.1.9, and the code samples for that version no longer work. If you go down this path, be sure to use the latest version which can be found at:

https://pypi.python.org/pypi/pylinkgrammar

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow