Nanotron code organisation

nanotron.distributed file
- wrapper around torch.distributed
- they use @cache decorator for get_rank type calls
  - get_global_rank cache has a speedup of 4 tflops on a 7b model
- get_rank(group) gives the “local” rank of a process within the given group
- get_global_rank(group, group_rank) gives the global rank given the local rank within a given group
- Is this correct? I though rank was unstable, given nodes can fail? Depends on how failure is handled
nanotron.trainer file
- Defines DistributedTrainer
  - can log the throughput by setting env variable “NANOTRON_BENCHMARK”=1
- _init_model()
  - init RoPE
  - builds the model using build_model()
  - make_ddp = DP > 1 and not(grad_accum_in_fp32 and zero_stage>0)
    - model = DistributedDataParallel(model, process_group=parallel_context.dp_pg, broadcast_buffers=False, bucket_cap_mb=config.model.ddp_bucket_cap_mb)
      - bucket_cap_mb – DistributedDataParallel will bucket parameters into multiple buckets so that gradient reduction of each bucket can potentially overlap with backward computation. bucket_cap_mb controls the bucket size in MegaBytes (MB). (default: 25)
      - broadcast_buffers (bool) – Flag that enables syncing (broadcasting) buffers of the module at beginning of the forward function. (default: True)
nanotron.models module
- defines the NanotronModel class
  - contains its parallel context
  - input_pp_rank and output_pp_rank
- build_model()
  - first get model = model_builder() e.g. LLama definition
  - gets all model chunks and defines the pipeline
    - computes compute cost to balance compute across PP blocks
    - assigns pipeline blocks to a given rank/process according to computed assignment
    - sequential assignment ⇒ assumes G-Pipe or 1F1B, doesn’t work with interleaved 1F1B
nanotron.helpers file
- _vocab_size_with_padding(orig_vocab_size: int, tp_pg_size: int, make_vocab_size_divisible_by: int)
  - Pad vocab size so it is divisible by pg_size $\times$ make_vocab_size_divisible_by
  - pretty important!
    - https://www.thonking.ai/p/what-shapes-do-matrix-multiplications
nanotron.parallel module
- context file
  - Defines ParallelContext
    - holds the 3D parallelism process groups definitions
  - Only nccl backend is supported for now. :(
    - For TPUs, you actually need to use XLA (Accelerated Linear Algebra)torch_xla.distributed.core.xla_model.mesh_reduce("loss, np.mean) instead of torch.distributed.reduce(loss, op=torch.distributed.ReduceOp.SUM)
    - The reason is that many backends (except nccl) don’t support reduce_Scatter, to emulate the behaviour, you need to use AlltoAll with a sum(), which is expensive. (https://github.com/pytorch/pytorch/blob/2b267fa7f28e18ca6ea1de4201d2541a40411457/torch/distributed/nn/functional.py#L317)
  - AMD has their equivalent of nccl, called rccl, which does support reduce_scatter!
  - has an cryptic piece of code to create the 3D parallelism process groups
    - _init_parallel_groups()
    - rewrote it to be clearer :)))
- data_parallel.utils module
  - e.g. sync_gradients_across_dp
- pipeline_parallel module
  - engine file
    - contains the PipelineEngine
      - we have AllForwardAllBackwardPipelineEngine (a.k.a G-Pipe)
      - we have OneForwardOneBackwardPipelineEngine (a.ka. 1F1B or Pipe-Dream)
    - the TensorPointer dataclass
      - Dataclass specifying from which rank we need to query a tensor from in order to access data
  - utils file
    - defines get_input_output_pp_ranks(model)
      - to know which ranks to feed the dataloader to
- tensor_parallel module
  - not clear whether sequence parallelism is actually supported
  - sequence parallel == TensorParallelLinearMode.REDUCE_SCATTER ?
    - (first sync) $g$ is an all-gather operation along the sequence dimension in the forward pass, and reduce-scatter in the backward pass
    - (second sync) $\overset{g}{ˉ}$ is a reduce-scatter in the forward pass, and all-gather in the backward pass
  - Classic TP is TensorParallelLinearMode.ALL_REDUCE
    - (first sync) $f$ is an identity (or splitting) in the forward, and an all-reduce in the backward
    - (second sync) $g$ is an all-reduce in the forward where the matrix are aggregated by summing, identity (or splitting) in the backward
  - functional.py file
    - defines column_linear, row_linear and their async counterparts
- The parameters files
  - parameters file
    - Defines the base class for all parameters in Nanotronmodels NanotronParameter (inherites from torch.nn.Parameter)
      - each parameter has metadata (a dict)
        
        attribute_name
        
        tied_parameter info
        
        sharded_parameter info
  - sharded_parameters file
    - methods for sharding
      - given a torch.nn.Parameter, a process group, and a split config
        
        returns a sharded NanotronParameter
  - tied_parameters file
nanotron.utils module
- Includes main_rank_first context
nanotron.config module
- Includes all definitions of args
  - for data, parallelism, model,
- the method get_config_from_file
nanotron.dataloader file
- Includes
  - clm_process (causal language modeling preprocessing)
  - get_datasets (get datasets from hf)
  - get_train_dataloader()

🤖 Harold's Notes

Explorer

Nanotron code organisation

Is this correct? I though rank was unstable, given nodes can fail? Depends on how failure is handled

Graph View

Backlinks