• We’re using zero-3 if hasattr(p, "ds_tensor") = True for parameter p

FP8

Sharding and gathering

DeepSpeed’s approach to gathering parameters during the forward pass is different from FSDP’s unit-based approach.

  • Parameter Tracking:
    • DeepSpeed uses a system of parameter tracking and just-in-time gathering. It doesn’t have predefined units like FSDP, but instead tracks each parameter’s usage.
    • When an operation needs a parameter, DeepSpeed gathers it on-demand from across the shards.

Parameter partitioning

  • This is set up during model initialization with the deepspeed.zero.Init() context manager.

    • deepspeed/runtime/zero/partition_parameters.py
  • The actual meat of the work is in _convert_to_deepspeed_param() and partition()

Parameter tracking

  • DeepSpeed tracks the status of parameters using the ZeroParamStatus enum, which can be NOT_AVAILABLE, AVAILABLE, or INFLIGHT
  • In deepspeed/runtime/zero/partition_parameters.py, the work is done through NoGatherCoalescedHandle and