About definition for terms of audio codec

Question

You are already familiar with the sample rate defintion. The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T. So for a sampling rate of 44100 Hz, you have 44100 samples per second (per audio channel).

The number of frames per second in video is a similar concept to the number of samples per second in audio. Frames for our eyes, samples for our ears. Additional infos here.

If you have 16 bits depth stereo PCM it means you have 16*44100*2 = 1411200 bits per second => ~ 172 kB per second => around 10 MB per minute.

To the definition in reworded terms from Apple:

Sample: a single number representing the value of one audio channel at one point in time.
Frame: a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.
Packet: a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.

As you can see there is a subtle difference between audio and video frame notions. In one second you have for stereo audio at 44.1 kHz: 88200 samples and thus 44100 frames.

Compressed format like MP3 and AAC pack multiple frames in packets (these packets can then be written in MP4 file for example where they could be efficiently interleaved with video content). You understand that dealing with large packets helps to identify bits patterns for better coding efficiency.

MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all.

For AAC you can have 1024 (or 960) frames per packet. This is described in the Apple document you pointed at:

The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.

In MPEG-based file format a packet is referred to as a data frame (not to be mingled with the previous audio frame notion). See Brad comment for more information on the subject.