1. From pressure wave to discrete-time signal

Pipeline:

sound in air β†’ microphone β†’ analog voltage β†’ anti-alias filter β†’ sampler β†’ ADC β†’ samples

  • Microphone: produces a voltage proportional (roughly) to air pressure.

  • Anti-alias filter: low-pass with cutoff just below ​​.

  • Sampler: takes values at , ​.

You get a discrete-time, continuous-valued signal:

At this point, still real-valued.


2. Quantization: continuous amplitude β†’ discrete levels

ADC then maps each real to one of levels, where = bit depth.

For uniform PCM quantization over a symmetric range :

  • Step size: ​​.
  • Quantizer index (conceptually):
  • Stored integer: something like

Classic rule of thumb for a full-scale sine in an ideal uniform quantizer:

For a full explanation and derivation, see Quantization, SNR, Bits, and Sample Rate

So:

  • 16-bit β‰ˆ 98 dB dynamic range.
  • 8-bit β‰ˆ 50 dB β†’ this is where companding (like ΞΌ-law) starts to matter.

In memory, this might be:

  • int16
  • int24 packed into 3 bytes
  • float32 typically used in ML frameworks β†’ values normalized to [βˆ’1,1][-1,1][βˆ’1,1]

Most β€œraw” audio APIs & WAV files are linear PCM integers; ML code often converts to float.


3. How it’s actually laid out in memory / files

At the lowest level, audio is just contiguous samples, e.g.:

  • Mono, 16-bit PCM at 44.1 kHz β†’ array of int16:

  • Stereo β†’ interleaved by default:

A typical WAV file =

  1. Header (metadata)

    • sample rate (e.g. 44100)
    • bit depth (e.g. 16)
    • number of channels (1, 2, …)
    • encoding (e.g. PCM, Β΅-law, etc.)
  2. Data chunk = raw PCM samples in the chosen format.

So everything you care about at the DSP/ML level is basically:

  • sample rate
  • bit depth / sample format (int or float, ΞΌ-law or linear)
  • number of channels

4. Typical ML representation

When you load audio with something like torchaudio, librosa, etc., you usually get

  • A tensor of shape
    • (channels, time) or (time,) for mono
  • dtype:
    • float32 in (most common), or
    • int16 if you ask for raw PCM.

Then you might transform it further:

  • STFT β†’ complex spectrogram
  • |Β·| β†’ magnitude spectrogram
  • mel filterbank β†’ mel-spectrogram
  • β†’ log-mel spectrogram