Question

I'm creating a program which follows certain rules to result in a count of the words, syllables, and sentences in a given text file.

A sentence is a collection of words separated by whitespace that ends in a . or ! or ? However, this is also a sentence:

Greetings, earthlings..

The way I've approached this program is to scan through the text file one character at a time using getchar(). I am prohibited from working with the the entire text file in memory, it must be one character or word at a time.

Here is my dilemma: using getchar() i can find out what the current character is. I just keep using getchar() in a loop until it finds the EOF character. But, if the sentence has multiple periods at the end, it is still a single sentence. Which means I need to know what the last character was before the one I'm analyzing, and the one after it. Through my thinking, this would mean another getchar() call, but that would create problems when i go to scan in the next character (its now skipped a character).

Does anyone have a suggestion as to how i could determine that the above sentence, is indeed a sentence?

Thanks, and if you need clarification or anything else, let me know.

Was it helpful?

Solution

You just need to implement a very simple state machine. Once you've found the end of a sentence you remain in that state until you find the start of a new sentence (normally this would be a non-white space character other than a terminator such as . ! or ?).

OTHER TIPS

You need an extensible grammar. Look for example at regular expressions and try to build one.

Generally human language is diverse and not easily parseable especially if you have colloquial speech to analyze or different languages. In some languages it may not even be clear what the distinction between a word and a sentence is.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top