The LLM engine

Setting up the engine

  • We create
    • the model config
    • the tokenizer
    • the scheduler
    • all the ModelRunners (that will take care of the model forward pass)

Splitting the model over multiple ranks

  • Spawning one process for each TP rank
import torch.multiprocessing as mp
 
self.ps = [] # processes that will run the ModelRunner
self.events = []  # events for synchronization between processes
ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
 
	event = ctx.Event()
	process = ctx.Process(target=ModelRunner, args=(config, i, event))
	process.start()
	self.ps.append(process)
	self.events.append(event)
  • Within each process, ModelRunner will set its GPU rank

ModelRunner

It contains:

  • the actual model

Model implementation

  • Currently defines Qwen3ForCausalLM
    • The linear layers are tailored to tensor-parallel i.e. ColumnParallelLinear or RowParallelLinear

Tensor-parallel linears