A100

  • For example, A100-80GB has:
    • 6912 CUDA Cores
    • 432 Tensor Cores (Gen 3)
    • 108 Streaming Multiprocessors (SM)

H100

  1. 132 streaming multiprocessors (SM)
  2. Each SM has 128 FP32 CUDA cores (so a total of 16896 (132 * 128) CUDA cores)
  3. Each SM has 227 KB of shared memory
  4. And this memory has a bandwidth of 33 TB/s

• H100 - 2-3x faster than A100 (half precision), 6x faster for fp8, becoming available on all Tier-1 compute clouds.

• GH200 - 2 chips on one card - (1) H100 w/ 96GB HBM3 or 144GB HBM3e + (2) Grace CPU w/ 624GB RAM - first units have been reported to become available.

NVIDIA

  • Abbreviations:

    • CUDA: Compute Unified Device Architecture (proprietary to NVIDIA)
  • NVIDIA-specific key GPU characteristics:

  • CUDA Cores

    • similar to CPU cores, but unlike CPUs that typically have 10-100 powerful cores, CUDA Cores are weaker and come in thousands and allow to perform massive general purpose computations (parallelization). Like CPU cores CUDA Cores perform a single operation in each clock cycle.
  • Tensor Cores

    • special compute units that are designed specifically to perform fast multiplication and addition operations like matrix multiplication. These perform multiple operations in each clock cycle. They can execute extremely fast computations on low or mixed precision data types with some loss (fp16, bf16, tf32, fp8, etc.). These cores are specifically designed for ML workloads.
  • Streaming Multiprocessors (SM)

    • clusters of CUDA Cores, Tensor Cores and other components.