You need to use speech recognizer.
Create a language model directly from the book text. That will make the recognition of the book reading very accurate, both original reading and the reading by the user.
Use this language model to recognize the book and assign timestamps for the words or use more advanced algorithm to perform text to audio alignment.
Recognize user's speech with the book-specific language model and use the recognized text to display a position in a book.
You can use CMUSphinx for the mentioned tasks.