1. From pressure wave to discrete-time signal
Pipeline:
sound in air β microphone β analog voltage β anti-alias filter β sampler β ADC β samples
-
Microphone: produces a voltage proportional (roughly) to air pressure.
-
Anti-alias filter: low-pass with cutoff just below ββ.
-
Sampler: takes values at , β.
You get a discrete-time, continuous-valued signal:
At this point, still real-valued.
2. Quantization: continuous amplitude β discrete levels
ADC then maps each real to one of levels, where = bit depth.
For uniform PCM quantization over a symmetric range :
- Step size: ββ.
- Quantizer index (conceptually):
- Stored integer: something like
Classic rule of thumb for a full-scale sine in an ideal uniform quantizer:
For a full explanation and derivation, see Quantization, SNR, Bits, and Sample Rate
So:
- 16-bit β 98 dB dynamic range.
- 8-bit β 50 dB β this is where companding (like ΞΌ-law) starts to matter.
In memory, this might be:
int16int24packed into 3 bytesfloat32typically used in ML frameworks β values normalized to [β1,1][-1,1][β1,1]
Most βrawβ audio APIs & WAV files are linear PCM integers; ML code often converts to float.
3. How itβs actually laid out in memory / files
At the lowest level, audio is just contiguous samples, e.g.:
-
Mono, 16-bit PCM at 44.1 kHz β array of
int16: -
Stereo β interleaved by default:
A typical WAV file =
-
Header (metadata)
- sample rate (e.g. 44100)
- bit depth (e.g. 16)
- number of channels (1, 2, β¦)
- encoding (e.g. PCM, Β΅-law, etc.)
-
Data chunk = raw PCM samples in the chosen format.
So everything you care about at the DSP/ML level is basically:
- sample rate
- bit depth / sample format (int or float, ΞΌ-law or linear)
- number of channels
4. Typical ML representation
When you load audio with something like torchaudio, librosa, etc., you usually get
- A tensor of shape
(channels, time)or(time,)for mono
- dtype:
float32in (most common), orint16if you ask for raw PCM.
Then you might transform it further:
- STFT β complex spectrogram
- |Β·| β magnitude spectrogram
- mel filterbank β mel-spectrogram
- β log-mel spectrogram