• nanotron.distributed file

    • wrapper around torch.distributed
    • they use @cache decorator for get_rank type calls
      • get_global_rank cache has a speedup of 4 tflops on a 7b model
    • get_rank(group) gives the “local” rank of a process within the given group
    • get_global_rank(group, group_rank) gives the global rank given the local rank within a given group
    • Is this correct? I though rank was unstable, given nodes can fail? Depends on how failure is handled

  • nanotron.trainer file

    • Defines DistributedTrainer
      • can log the throughput by setting env variable “NANOTRON_BENCHMARK”=1
    • _init_model()
      • init RoPE
      • builds the model using build_model()
      • make_ddp = DP > 1 and not(grad_accum_in_fp32 and zero_stage>0)
        • model = DistributedDataParallel(model, process_group=parallel_context.dp_pg, broadcast_buffers=False, bucket_cap_mb=config.model.ddp_bucket_cap_mb)
          • bucket_cap_mb – DistributedDataParallel will bucket parameters into multiple buckets so that gradient reduction of each bucket can potentially overlap with backward computation. bucket_cap_mb controls the bucket size in MegaBytes (MB). (default: 25)
          • broadcast_buffers (bool) – Flag that enables syncing (broadcasting) buffers of the module at beginning of the forward function. (default: True)
  • nanotron.models module

    • defines the NanotronModel class
      • contains its parallel context
      • input_pp_rank and output_pp_rank
    • build_model()
      • first get model = model_builder() e.g. LLama definition
      • gets all model chunks and defines the pipeline
        • computes compute cost to balance compute across PP blocks
        • assigns pipeline blocks to a given rank/process according to computed assignment
        • sequential assignment ⇒ assumes G-Pipe or 1F1B, doesn’t work with interleaved 1F1B
  • nanotron.helpers file

  • nanotron.parallel module

    • context file

      • Defines ParallelContext
        • holds the 3D parallelism process groups definitions
      • Only nccl backend is supported for now. :(
      • AMD has their equivalent of nccl, called rccl, which does support reduce_scatter!
      • has an cryptic piece of code to create the 3D parallelism process groups
        • _init_parallel_groups()
        • rewrote it to be clearer :)))
    • data_parallel.utils module

      • e.g. sync_gradients_across_dp
    • pipeline_parallel module

      • engine file
        • contains the PipelineEngine
          • we have AllForwardAllBackwardPipelineEngine (a.k.a G-Pipe)
          • we have OneForwardOneBackwardPipelineEngine (a.ka. 1F1B or Pipe-Dream)
        • the TensorPointer dataclass
          • Dataclass specifying from which rank we need to query a tensor from in order to access data
      • utils file
        • defines get_input_output_pp_ranks(model)
          • to know which ranks to feed the dataloader to
    • tensor_parallel module

      • not clear whether sequence parallelism is actually supported
      • sequence parallel == TensorParallelLinearMode.REDUCE_SCATTER ?
        • (first sync) is an all-gather operation along the sequence dimension in the forward pass, and reduce-scatter in the backward pass
        • (second sync) is a reduce-scatter in the forward pass, and all-gather in the backward pass
      • Classic TP is TensorParallelLinearMode.ALL_REDUCE
        • (first sync) is an identity (or splitting) in the forward, and an all-reduce in the backward
        • (second sync) is an all-reduce in the forward where the matrix are aggregated by summing, identity (or splitting) in the backward
      • functional.py file
        • defines column_linear, row_linear and their async counterparts
    • The parameters files

      • parameters file
        • Defines the base class for all parameters in Nanotronmodels NanotronParameter (inherites from torch.nn.Parameter)
          • each parameter has metadata (a dict)
            • attribute_name
            • tied_parameter info
            • sharded_parameter info
      • sharded_parameters file
        • methods for sharding
          • given a torch.nn.Parameter, a process group, and a split config
            • returns a sharded NanotronParameter
      • tied_parameters file
  • nanotron.utils module

    • Includes main_rank_first context
  • nanotron.config module

    • Includes all definitions of args
      • for data, parallelism, model,
    • the method get_config_from_file
  • nanotron.dataloader file

    • Includes
      • clm_process (causal language modeling preprocessing)
      • get_datasets (get datasets from hf)
      • get_train_dataloader()