Audio Compression Techniques
by Rusty Nejdl
Many different compression techniques
exist for for various forms of data. Video compression is simpler
because many pixels are repeated in groups. Different techniques
for still pictures include horizontal repeated pixel compression (pcx format),
data conversion (gif format), and fractal path repeated pixels. For
motion video, compression is relatively easy because large portions of
the screen don't change between each frame; therefore, only the changes
between images need to be stored. Text compression is extremely
simple compared to video and audio. One method counts the probability
of each character and then reassigns smaller bit values to the most common
characters and larger bit values to the least common characters.
However, digital samples of audio data
have proven to be very difficult to compress; these techniques do not work
well at all for audio data. The data change often, and no values
are common enough to save sufficient space. Currently, five methods
are used to compress audio data with varying degrees of complexity, compressed
audio quality, and amount of data compression.
Sampling Basics
The digital representation of audio data offers
many advantages : high noise immunity, stability, and reproducibility.
Audio in digital form also allows for efficient implementation of many
audio processing functions through the computer.
Converting audio from analog to digital begins
by sampling the audio input at regular, discrete intervals of time and
quantizing the sampled values into a discrete number of evenly spaced levels.
According to the Nyquist theory, a time-sampled signal can faithfully represent
a signal up to half the sampling rate. Above that threshold, frequencies
become blurred and signal noise becomes readily apparent.
The sampling frequencies in use today range
from 8 kHz for basic speech to 48 kHz for commercial DAT machines.
The number of quantizer levels is typically a power of 2 to make full use
of a fixed number of bits per audio sample. The typical range for
bits per sample is between 8 and 16 bits. This allows for a range
of 256 to 65,536 levels of quantization per sample. With each additional
bit of quantizer spacing, the signal to noise ratio increases by roughly
6 decibels (dB). Thus, the dynamic range capability of these representations
is from 48 to 96 dB, respectively.
The data rates associated with uncompressed
digital audio are substantial. For audio data on a CD, for example,
which is sampled at 44.1 kHz with 16 bits per channel for two channels,
about 1.4 megabits per second are processed. A clear need exists
for some form of compression to enable the more efficient storage and transmission
of digital audio data.
Voc File Compression
The simplest compression techniques simply removed
any silence from the entire sample. Creative Labs introduced this
form of compression with their introduction of the Soundblaster line of
sound cards. This method analyzes the whole sample and then codes
the silence into the sample using byte codes. It is very similar
to run-length coding.
Linear Predictive Coding and Code Excited Linear Predictor
This was an early development in audio compression
that was used primarily for speech. A Linear Predictive Coding (LPC)
encoder compares speech to an analytical model of the vocal tract, then
throws away the speech and stores the parameters of the best-fit model.
The output quality was poor and was often compared to computer speech and
thus is not used much today.
A later development, Code Excited Linear Predictor(CELP),
increased the complexity of the speech model further, while allowing for
greater compression due to faster computers, and produced much better results.
Sound quality improved, while the compression ratio increased. The
algorithm compares speech with an analytical model of the vocal tract and
computes the errors between the original speech and the model. It
transmits both model parameters and a very compressed representation of
the errors.
Mu-law and A-law compression
Logarithmic compression is a good method because
it matches the way the human ear works. It only loses information
which the ear would not hear anyway, and gives good quality results for
both speech and music. Although the compression ratio is not very
high it requires very little processing power to achieve. It is the international
standard telephony encoding format, also known as ITU (formerly CCITT)
standard. It is commonly used in North America and Japan for ISDN
8 kHz sampled, voice grade, digital telephone service.
It packs each 16-bit sample into 8 bits by
using a logarithmic table to encode a 13-bit dynamic range, dropping the
least significant 3 bits of precision. The quantization levels are
dispersed unevely instead of linearly to mimic the way that the human ear
perceives sound levels differently at different frequencies. Unlike
linear quantization, the logarithmic step spacings represent low-amplitude
samples with greater accuracy than higher-amplitude samples. This
method is fast and compresses data into half the size of the original sample.
This method is used quite widely due to the universal nature of its adoption.
Adaptive Differential Pulse Code Modulation (ADPCM)
The Interactive Multimedia Association (IMA) is
a consortium of computer hardware and software vendors cooperating to develop
a standard for multimedia data. Their goal was to select a public-domain
audio compression algorithm that is able to provide a good compression
ratio while maintaining good audio quality. In addition, the coding
had to be simple enough to enable software-only decoding of 44.1 kHz samples
on a 20 MHz, 386-class computer.
This process is a simple conversion based
on the assumption that the changes between samples will not be very large.
The first sample value is stored in its entirety, and the each successive
value describes the amount +/- 8 levels that the wave will change, which
uses only 4 instead of 16 bits. Therefore, a 4:1 compression ratio
is achieved with less loss as the sampling frequency increases. At
44.1 kHz, the compressed signal is an accurate representation of the uncompressed
sample that is difficult to discern from the original. This method
is used widely today because of its simplicity, wide acceptance, and high
level of compression.
MPEG
The Motion Picture Experts Group (MPEG) audio
compression algorithm is an International Organization for Standardization
(ISO) standard for high fidelity audio compressions. It is one of
a three-part compression standard, the other two being video and system.
The MPEG compression is lossy, but nonetheless can achieve transparent,
perceptually lossless compression.
MPEG compression is firmly founded in psychoaccoustic
theory. The premise behind this technique is simply: if the sound
cannot be heard by the listener, then it does not need to be coded.
Human hearing is quite sensitive, but discerning differences in a collage
of sounds is quite difficult. Masking is the phenomenon where a strong
signal "covers" the sound of another signal such that the softer one cannot
be heard by the human ear. An extension of this is temporal masking,
which describes masking of a soft sound after loud has stopped. The
time, measured under scientific conditions, that it takes to hear the softer
sound is about 5 ms. Because the sensitivity of the ear is not linear
but is instead dependent upon the frequency, masking effects differ depending
on the frequency of the sounds.
MPEG compression uses masking as the basis
for compressing the audio data. Those sounds that cannot be heard
by the human ear do not need to be encoded. The audio spectrum is
divided into 32 frequency bands because sound masking occurs over a range
of frequencies for each loud sound. Then the volume levels are measured
in each band to detect for any masking. Masking effects are taken
into account, and the signal is then encoded.
In addition to encoding a single signal, the
MPEG compression supports one or two audio channels in one of four modes:
1) Monophonic
2) Dual Monophonic -- two independent channels
3) Stereo -- for stereo channels that share bits, but not using
joint-stereo coding
4) Joint - stereo -- takes advantage of the correlations between
stereo channels
The MPEG method allows for a compression ratio
of up to 6:1. Under optimal listening conditions, expert listeners
could not distinguish the coded and original audio clips. Thus, although
this technique is lossy, it still produces accurate representations of
the original audio signal.