🤖 Harold's Notes

Search

❯

❯

❯

❯

❯

❯

Deepspeed Notes

Deepspeed Notes

Jul 29, 20241 min read

We’re using zero-3 if hasattr(p, "ds_tensor") = True for parameter p

FP8

Add fp8-fused gemm kernel #5764

Sharding and gathering

DeepSpeed’s approach to gathering parameters during the forward pass is different from FSDP’s unit-based approach.

Parameter Tracking:
- DeepSpeed uses a system of parameter tracking and just-in-time gathering. It doesn’t have predefined units like FSDP, but instead tracks each parameter’s usage.
- When an operation needs a parameter, DeepSpeed gathers it on-demand from across the shards.

Parameter partitioning

This is set up during model initialization with the deepspeed.zero.Init() context manager.
- deepspeed/runtime/zero/partition_parameters.py
The actual meat of the work is in _convert_to_deepspeed_param() and partition()

Parameter tracking

DeepSpeed tracks the status of parameters using the ZeroParamStatus enum, which can be NOT_AVAILABLE, AVAILABLE, or INFLIGHT
In deepspeed/runtime/zero/partition_parameters.py, the work is done through NoGatherCoalescedHandle and

Graph View

FP8
Sharding and gathering
Parameter partitioning
Parameter tracking

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025