Audio Compression Techniques
by Rusty Nejdl


      Many different compression techniques exist for for various forms of data.  Video compression is simpler because many pixels are repeated in groups.  Different techniques for still pictures include horizontal repeated pixel compression (pcx format), data conversion (gif format), and fractal path repeated pixels.  For motion video, compression is relatively easy because large portions of the screen don't change between each frame; therefore, only the changes between images need to be stored.   Text compression is extremely simple compared to video and audio.  One method counts the probability of each character and then reassigns smaller bit values to the most common characters and larger bit values to the least common characters.
      However, digital samples of audio data have proven to be very difficult to compress; these techniques do not work well at all for audio data.  The data change often, and no values are common enough to save sufficient space.  Currently, five methods are used to compress audio data with varying degrees of complexity, compressed audio quality, and amount of data compression.
 

Sampling Basics

     The digital representation of audio data offers many advantages : high noise immunity, stability, and reproducibility.  Audio in digital form also allows for efficient implementation of many audio processing functions through the computer.
     Converting audio from analog to digital begins by sampling the audio input at regular, discrete intervals of time and quantizing the sampled values into a discrete number of evenly spaced levels.  According to the Nyquist theory, a time-sampled signal can faithfully represent a signal up to half the sampling rate.  Above that threshold, frequencies become blurred and signal noise becomes readily apparent.
     The sampling frequencies in use today range from 8 kHz for basic speech to 48 kHz for commercial DAT machines.  The number of quantizer levels is typically a power of 2 to make full use of a fixed number of bits per audio sample.  The typical range for bits per sample is between 8 and 16 bits.  This allows for a range of 256 to 65,536 levels of quantization per sample.  With each additional bit of quantizer spacing, the signal to noise ratio increases by roughly 6 decibels (dB).  Thus, the dynamic range capability of these representations is from 48 to 96 dB, respectively.
     The data rates associated with uncompressed digital audio are substantial.  For audio data on a CD, for example, which is sampled at 44.1 kHz with 16 bits per channel for two channels, about 1.4 megabits per second are processed.  A clear need exists for some form of compression to enable the more efficient storage and transmission of digital audio data.
 

Voc File Compression

     The simplest compression techniques simply removed any silence from the entire sample.  Creative Labs introduced this form of compression with their introduction of the Soundblaster line of sound cards.  This method analyzes the whole sample and then codes the silence into the sample using byte codes.  It is very similar to run-length coding.
Linear Predictive Coding and Code Excited Linear Predictor
     This was an early development in audio compression that was used primarily for speech.  A Linear Predictive Coding (LPC) encoder compares speech to an analytical model of the vocal tract, then throws away the speech and stores the parameters of the best-fit model.  The output quality was poor and was often compared to computer speech and thus is not used much today.
     A later development, Code Excited Linear Predictor(CELP), increased the complexity of the speech model further, while allowing for greater compression due to faster computers, and produced much better results.  Sound quality improved, while the compression ratio increased.  The algorithm compares speech with an analytical model of the vocal tract and computes the errors between the original speech and the model.  It transmits both model parameters and a very compressed representation of the errors.
 

Mu-law and A-law compression

     Logarithmic compression is a good method because it matches the way the human ear works.  It only loses information which the ear would not hear anyway, and gives good quality results for both speech and music.  Although the compression ratio is not very high it requires very little processing power to achieve. It is the international standard telephony encoding format, also known as ITU (formerly CCITT) standard.  It is commonly used in North America and Japan for ISDN 8 kHz sampled, voice grade, digital telephone service.
     It packs each 16-bit sample into 8 bits by using a logarithmic table to encode a 13-bit dynamic range, dropping the least significant 3 bits of precision.  The quantization levels are dispersed unevely instead of linearly to mimic the way that the human ear perceives sound levels differently at different frequencies.  Unlike linear quantization, the logarithmic step spacings represent low-amplitude samples with greater accuracy than higher-amplitude samples.  This method is fast and compresses data into half the size of the original sample.  This method is used quite widely due to the universal nature of its adoption.
 

Adaptive Differential Pulse Code Modulation (ADPCM)

     The Interactive Multimedia Association (IMA) is a consortium of computer hardware and software vendors cooperating to develop a standard for multimedia data.  Their goal was to select a public-domain audio compression algorithm that is able to provide a good compression ratio while maintaining good audio quality.  In addition, the coding had to be simple enough to enable software-only decoding of 44.1 kHz samples on a 20 MHz, 386-class computer.
     This process is a simple conversion based on the assumption that the changes between samples will not be very large.  The first sample value is stored in its entirety, and the each successive value describes the amount +/- 8 levels that the wave will change, which uses only 4 instead of 16 bits.  Therefore, a 4:1 compression ratio is achieved with less loss as the sampling frequency increases.  At 44.1 kHz, the compressed signal is an accurate representation of the uncompressed sample that is difficult to discern from the original.  This method is used widely today because of its simplicity, wide acceptance, and high level of compression.

MPEG

     The Motion Picture Experts Group (MPEG) audio compression algorithm is an International Organization for Standardization (ISO) standard for high fidelity audio compressions.  It is one of a three-part compression standard, the other two being video and system.  The MPEG compression is lossy, but nonetheless can achieve transparent, perceptually lossless compression.
     MPEG compression is firmly founded in psychoaccoustic theory.  The premise behind this technique is simply: if the sound cannot be heard by the listener, then it does not need to be coded.  Human hearing is quite sensitive, but discerning differences in a collage of sounds is quite difficult.  Masking is the phenomenon where a strong signal "covers" the sound of another signal such that the softer one cannot be heard by the human ear.  An extension of this is temporal masking, which describes masking of a soft sound after loud has stopped.  The time, measured under scientific conditions, that it takes to hear the softer sound is about 5 ms.  Because the sensitivity of the ear is not linear but is instead dependent upon the frequency, masking effects differ depending on the frequency of the sounds.
     MPEG compression uses masking as the basis for compressing the audio data.  Those sounds that cannot be heard by the human ear do not need to be encoded.  The audio spectrum is divided into 32 frequency bands because sound masking occurs over a range of frequencies for each loud sound.  Then the volume levels are measured in each band to detect for any masking.  Masking effects are taken into account, and the signal is then encoded.
     In addition to encoding a single signal, the MPEG compression supports one or two audio channels in one of four modes:
 1) Monophonic
 2) Dual Monophonic -- two independent channels
 3) Stereo -- for stereo channels that share bits, but not using joint-stereo coding
 4) Joint - stereo -- takes advantage of the correlations between stereo channels
 
     The MPEG method allows for a compression ratio of up to 6:1.  Under optimal listening conditions, expert listeners could not distinguish the coded and original audio clips.  Thus, although this technique is lossy, it still produces accurate representations of the original audio signal.