- We’re using zero-3 if
hasattr(p, "ds_tensor") = True
for parameter p
FP8
Sharding and gathering
DeepSpeed’s approach to gathering parameters during the forward pass is different from FSDP’s unit-based approach.
- Parameter Tracking:
- DeepSpeed uses a system of parameter tracking and just-in-time gathering. It doesn’t have predefined units like FSDP, but instead tracks each parameter’s usage.
- When an operation needs a parameter, DeepSpeed gathers it on-demand from across the shards.
Parameter partitioning
-
This is set up during model initialization with theÂ
deepspeed.zero.Init()
 context manager.deepspeed/runtime/zero/partition_parameters.py
-
The actual meat of the work is in
_convert_to_deepspeed_param()
andpartition()
Parameter tracking
- DeepSpeed tracks the status of parameters using theÂ
ZeroParamStatus
 enum, which can beNOT_AVAILABLE
,AVAILABLE
, orINFLIGHT
- In
deepspeed/runtime/zero/partition_parameters.py
, the work is done throughNoGatherCoalescedHandle
and