The LLM engine
Setting up the engine
- We create
- the model config
- the tokenizer
- the scheduler
- all the
ModelRunners (that will take care of the model forward pass)
Splitting the model over multiple ranks
- Spawning one process for each TP rank
import torch.multiprocessing as mp
self.ps = [] # processes that will run the ModelRunner
self.events = [] # events for synchronization between processes
ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
event = ctx.Event()
process = ctx.Process(target=ModelRunner, args=(config, i, event))
process.start()
self.ps.append(process)
self.events.append(event)- Within each process,
ModelRunnerwill set its GPU rank
ModelRunner
It contains:
- the actual model
Model implementation
- Currently defines
Qwen3ForCausalLM- The linear layers are tailored to tensor-parallel i.e.
ColumnParallelLinearorRowParallelLinear
- The linear layers are tailored to tensor-parallel i.e.
Tensor-parallel linears
- https://github.com/GeeeekExplorer/nano-vllm/blob/2f214426530e2841e7d24c73ee0dfa914d62df56/nanovllm/layers/linear.py#L12
- they define a special function
weight_loaderthat allows for seamless weight loading among multiple ranks for TP.