-
nanotron.distributed
file- wrapper around torch.distributed
- they use
@cache
decorator forget_rank
type callsget_global_rank
cache has a speedup of 4 tflops on a 7b model
get_rank(group)
gives the “local” rank of a process within the given groupget_global_rank(group, group_rank)
gives the global rank given the local rank within a given group-
Is this correct? I though rank was unstable, given nodes can fail? Depends on how failure is handled
-
nanotron.trainer
file- Defines DistributedTrainer
- can log the throughput by setting env variable “NANOTRON_BENCHMARK”=1
_init_model()
- init RoPE
- builds the model using
build_model()
make_ddp = DP > 1 and not(grad_accum_in_fp32 and zero_stage>0)
model = DistributedDataParallel(model, process_group=parallel_context.dp_pg, broadcast_buffers=False, bucket_cap_mb=config.model.ddp_bucket_cap_mb)
- bucket_cap_mb –
DistributedDataParallel
will bucket parameters into multiple buckets so that gradient reduction of each bucket can potentially overlap with backward computation.bucket_cap_mb
controls the bucket size in MegaBytes (MB). (default: 25) - broadcast_buffers (bool) – Flag that enables syncing (broadcasting) buffers of the module at beginning of the
forward
function. (default:True
)
- bucket_cap_mb –
- Defines DistributedTrainer
-
nanotron.models
module- defines the
NanotronModel
class- contains its parallel context
input_pp_rank
andoutput_pp_rank
build_model()
- first get
model = model_builder()
e.g. LLama definition - gets all model chunks and defines the pipeline
- computes compute cost to balance compute across PP blocks
- assigns pipeline blocks to a given rank/process according to computed assignment
- sequential assignment ⇒ assumes G-Pipe or 1F1B, doesn’t work with interleaved 1F1B
- first get
- defines the
-
nanotron.helpers
file_vocab_size_with_padding(orig_vocab_size: int, tp_pg_size: int, make_vocab_size_divisible_by: int)
- Pad vocab size so it is divisible by pg_size make_vocab_size_divisible_by
- pretty important!
-
nanotron.parallel
module-
context
file- Defines
ParallelContext
- holds the 3D parallelism process groups definitions
- Only nccl backend is supported for now. :(
- For TPUs, you actually need to use XLA (Accelerated Linear Algebra)
torch_xla.distributed.core.xla_model.mesh_reduce("loss, np.mean)
instead oftorch.distributed.reduce(loss, op=torch.distributed.ReduceOp.SUM)
- The reason is that many backends (except nccl) don’t support
reduce_Scatter
, to emulate the behaviour, you need to useAlltoAll
with asum()
, which is expensive. (https://github.com/pytorch/pytorch/blob/2b267fa7f28e18ca6ea1de4201d2541a40411457/torch/distributed/nn/functional.py#L317)
- For TPUs, you actually need to use XLA (Accelerated Linear Algebra)
- AMD has their equivalent of
nccl
, calledrccl
, which does supportreduce_scatter
! - has an cryptic piece of code to create the 3D parallelism process groups
_init_parallel_groups()
- rewrote it to be clearer :)))
- Defines
-
data_parallel.utils
module- e.g.
sync_gradients_across_dp
- e.g.
-
pipeline_parallel
moduleengine
file- contains the
PipelineEngine
- we have
AllForwardAllBackwardPipelineEngine
(a.k.a G-Pipe) - we have
OneForwardOneBackwardPipelineEngine
(a.ka. 1F1B or Pipe-Dream)
- we have
- the
TensorPointer
dataclass- Dataclass specifying from which rank we need to query a tensor from in order to access data
- contains the
utils
file- defines
get_input_output_pp_ranks(model)
- to know which ranks to feed the dataloader to
- defines
-
tensor_parallel
module- not clear whether sequence parallelism is actually supported
- sequence parallel ==Â TensorParallelLinearMode.REDUCE_SCATTER ?
- (first sync) is an all-gather operation along the sequence dimension in the forward pass, and reduce-scatter in the backward pass
- (second sync) is a reduce-scatter in the forward pass, and all-gather in the backward pass
- Classic TP is TensorParallelLinearMode.ALL_REDUCE
- (first sync) is an identity (or splitting) in the forward, and an all-reduce in the backward
- (second sync) is an all-reduce in the forward where the matrix are aggregated by summing, identity (or splitting) in the backward
functional.py
file- defines
column_linear
,row_linear
and their async counterparts
- defines
-
The parameters files
parameters
file- Defines the base class for all parameters in Nanotronmodels
NanotronParameter
(inherites from torch.nn.Parameter)- each parameter has metadata (a dict)
- attribute_name
- tied_parameter info
- sharded_parameter info
- each parameter has metadata (a dict)
- Defines the base class for all parameters in Nanotronmodels
sharded_parameters
file- methods for sharding
- given a torch.nn.Parameter, a process group, and a split config
- returns a sharded NanotronParameter
- given a torch.nn.Parameter, a process group, and a split config
- methods for sharding
tied_parameters
file
-
-
nanotron.utils
module- Includes
main_rank_first
context
- Includes
-
nanotron.config
module- Includes all definitions of args
- for data, parallelism, model,
- the method
get_config_from_file
- Includes all definitions of args
-
nanotron.dataloader
file- Includes
clm_process
(causal language modeling preprocessing)get_datasets
(get datasets from hf)get_train_dataloader()
- Includes