Building Voice Over IP from zero [closed]

Question 1

I don't think I'd do this. The simple fact is that collecting two seconds (or even one second) of voice data before you transmit loses you quite a lot and gains you nothing.

There are a lot of places you can simplify your protocol compared to normal open protocols. A typical protocol has all sorts of options for multiple transmission rates, presence detection, NAT traversal, multiple codecs, etc. These make a normal voice chat program relatively complex. By eliminating the majority of them and just pre-selecting one set of options, you can simplify your code quite a bit.

Sending packets every few milliseconds, however, is not difficult. Sending packets every few seconds instead isn't going to make your code any simpler. If anything, it's likely to make the code more complex, because you'll have to deal with storing quite a bit more data. In a typical case, you're dealing with only a few kilobytes of data at a time, so storage is almost completely a non-issue. If you store a lot of data before transmitting, storing the data will start to become a much more substantial problem (though, in fairness, it still won't be exactly terribly difficult).

Personally, I think I'd still use some standard codecs and such so the code and protocol would be easy (or easier, anyway) to expand out to something more complete if you decide to do that. For example, if I wanted to keep things as simple as possible, I'd probably start by using the G.711 codec. Even that supports two forms of compression (mu-law and A-law), so I'd probably choose one of those (probably A-law) and just use it.

Using that, the actual codec (the compression/expansion code) should be well under 100 lines of code (probably closer to 50 lines, depending somewhat on how you prefer to format your code). If you want, you can download the reference implementation from the ITU in G.191 (Note: G.191 also includes code for a number of other codecs).

That gives you some degree of compression almost for free. About equally important, it means you'll structure your code to have a place to call the coder to encode the data before you send it, and decode data after you receive it. If you ever decide to enhance the code, you end up choosing a different encoder/decoder, not trying to add one where none existed before (in which case, you're a lot more likely to need a complete rewrite).

G.711 is intended to operate on a buffer of samples at a time. The supported buffer sizes are 40, 80, 160 and 320 samples. If you don't care about latency, 320 samples would be the obvious choice. Using that, you read 320 samples from your input (microphone), send it to the compressor, put the result into a UDP packet, and ship it over the wire. Repeat as needed. You probably want to include a sequence number in the UDP packet, so the receiving end can play back packets in order. Again, I'd probably follow a standard. RTP is trivial enough that it probably adds only another few dozen lines of code or so (maybe even less than that).

To simplify as much as possible, the receiving code might initially ignore the entire RTP header, and just receive a packet, decode the payload, play it back, and repeat. Later, when/if you find that packet loss and reordering is a problem, you can add code to look at the sequence number and/or timestamp, and act accordingly.

The big point here is that waiting 2 seconds (or whatever) isn't going to make your code simpler. If anything, working with a fixed (and fairly small) number of samples at a time is likely to make the code simpler. You can pre-allocate a couple of buffers of the size you care about, and just use them, instead of dealing with dynamic allocation as you'd probably do for buffering a couple of seconds of data at a time.

Question 2

If you wait 2 seconds before transmitting audio, it won't feel like a conversation. People get annoyed if the delay goes over 100 ms, maybe 200 ms.

Secondly, VoIP applications usually are meant to talk to others. Until your application achieves world dominance, its probably a good idea to foillow one of the established VoIP protocols (H.323 or SIP), so you can talk to others. Just a thought.