Speech to text in c#

Question 1

This cannot be done without involving linguistic resources. Let me explain what I mean by this.

As you may have noticed, your C# program only recognizes pre-recorded phrases and only if you say the exact same words. (As an aside node, this is quite an achievement in itself, because you can hardly say a sentence twice without altering it a bit. Small changes, that is, e.g. in sound frequency or lengths, might not be relevant to your colleagues, but they matter to your program).

Therefore, you need to incorporate a kind of linguistic resource in your program. In other words, make it "understand" facts about human language. Two suggestions with increasing complexity below. All apporaches assume that your tool is capable of tokenizing an audio input stream in a sensible way, i.e. extract words from it.

Pattern matching

To avoid hard-coding the sentences like

Tell me about the weather.
What's the weather tomorrow?
Weather report!

you can instead define a pattern that matches any of those sentences:

if a sentence contains "weather", then output a weather report

This can be further refined in manifold ways, e.g. :

if a sentence contains "weather" and "tomorrow", output tomorrow's forecast.
if a sentence contains "weather" and "Bristol", output a forecast for Bristol

This kind of knowledge must be put into your program explicitly, for instance in the form of a dictionary or lookup table.

Measuring Similarity

If you plan to spend more time on this, you could implement a means for finding the similarity between input sentences. There are many approaches to this as well, but a prominent one is a bag of words, represented as a vector.

In this model, each sentence is represented as a vector, each word in it present as a dimension of the vector. For example, the sentence "I hate green apples" could be represented as

I = 1
hate = 1
green = 1
apples = 1
red = 0
you = 0

Note that the words that do not occur in this particular sentence, but in other phrases the program is likely to encounter, also represent dimensions (for example the red = 0).

The big advantage of this approach is that the similarity of vectors can be easily computed, no matter how multi-dimensional they are. There are several techniques that estimate similarity, one of them is cosine similarity (see for example http://en.wikipedia.org/wiki/Cosine_similarity).

On a more general note, there are many other considerations to be made of course.

For example, some words might be utterly irrelevant to the message you want to convey, as in the following sentence:

I want you to output a weather report.

Here, at least "I", "you" "to" and "a" could be done away with without damaging the basic semantics of the sentence. Such words are called stop words and are discarded early in many tools that perform speech-to-text analysis.

Also note that we started out assuming that your program reliably identifies sound input. In reality, no tool is capable of infallibly identifying speech.

Humans tend to forget that sound actually exists without cues as to where word or sentence boundaries are. This makes so-called disambiguation of input a gargantuan task that is easily underestimated - and ambiguity one of the hardest problems of computational linguistics in general.

Question 2

For that, the code won't be able to judge that! You need to split the command in text array! Such as

Tomorrow
Weather
What

This way, you will compare it with the text that is present in your computer! Lets say, with the command (what) with type (weather) and with the time (tomorrow).

It is better to read and understand each word, then guess it will work as Google! Google uses the same, they break down the string and compare it.