To answer the first part, I simply state my old answer to the super user question "How does Truecrypt know it has the correct password?"
It knows the correct password because within that encrypted container there is a known header.
When Trucrypt decrypts a blob of data and the header matches what it was expecting it reports back that the decryption was successful. If you use a incorrect password it will still "decrypt" the text, but it will decrypt the header in to gibberish and fail the decryption check.
Here is a link to the specification, you can see there are many things that must be true for it to be a valid header (bytes 64-67 after decryption should always be the ASCII value
TRUE
, bytes 132-251 must all be0
's, ect.). If you you decrypt a blob of data and it does not match that header format, you know the decryption failed.
So they already do what you where suggesting about "checking the grammar", they attempt to decrypt the message and if the message "has proper grammar" (the data follows the spec of the encrypted file format) the message was successfully decrypted.
For your 2nd part of "using a dictionary" there are a few important issues.
First, this would only work on plain unformatted text, no binary data or text metadata allowed. However, more importantly, second how do you "create" this dictionary? If you create the dictionary on the fly using the words in the document tell me what would be the dictionary for the following message:
We attack tomorrow!
You could pad the dictionary with extra words but how do you choose the padding? If you used an existing fixed dictionary, what if a word is not in the dictionary, what do you then? What about misspellings?
I have not even begun to touch on how this method is very likely to leek information. Like you said, English has a set of rules for grammar and some words more often come near the end of sentences and some words come more often near the start of sentences, looking at the numbers used as the indexes you could potentially do a statistical analysis on it and rule out a portion of the dictionary as "unlikely" to be used words.
I am sure there are many many other problems with this, but I am only a beginner in crypto and I can not think of any others off of the top of my head.
There is an adage in cryptography "It is easy to for you to create a cypher that you yourself can not break, it is quite hard for you to make a cypher that other people can not break"