Pregunta

I am trying to build a program that will find which page/sentence in a book is read to microphone. I have the book's text and its audio content. The user will start reading from a random page and program is supposed to synch to the user and show the section of the book which is being read. It might seem useless program but please bear with me..

Would an approach similar to shazam-like programs work? I am not sure how effective those algorithms for speech. Also, the speaker will be different and might have accent and different speeds to read.

Another approach would be converting the speech to text and searching the text in the book. The problem is that the language of the book is a rare one for which there is no language model available. In addition, the script does not use latin characters which makes programming difficult (for me at least).

Is there any solutions that anyone can recommend? Would extracting features from the audio file and comparing with the "real-time" extracted features (from microphone) would work? Which features?

Any implementation/code that I can start with? Any language is ok but prefer C.

¿Fue útil?

Solución

You need to use speech recognizer.

  1. Create a language model directly from the book text. That will make the recognition of the book reading very accurate, both original reading and the reading by the user.

  2. Use this language model to recognize the book and assign timestamps for the words or use more advanced algorithm to perform text to audio alignment.

  3. Recognize user's speech with the book-specific language model and use the recognized text to display a position in a book.

You can use CMUSphinx for the mentioned tasks.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top