https://github.com/GeeeekExplorer/nano-vllm

The LLM engine

Setting up the engine

We create
- the model config
- the tokenizer
- the scheduler
- all the ModelRunners (that will take care of the model forward pass)

Splitting the model over multiple ranks

Spawning one process for each TP rank

import torch.multiprocessing as mp
 
self.ps = [] # processes that will run the ModelRunner
self.events = []  # events for synchronization between processes
ctx = mp.get_context("spawn")
for i in range(1, config.tensor_parallel_size):
 
	event = ctx.Event()
	process = ctx.Process(target=ModelRunner, args=(config, i, event))
	process.start()
	self.ps.append(process)
	self.events.append(event)

Within each process, ModelRunner will set its GPU rank

ModelRunner

It contains:

the actual model

Model implementation

Currently defines Qwen3ForCausalLM
- The linear layers are tailored to tensor-parallel i.e. ColumnParallelLinear or RowParallelLinear

Tensor-parallel linears

https://github.com/GeeeekExplorer/nano-vllm/blob/2f214426530e2841e7d24c73ee0dfa914d62df56/nanovllm/layers/linear.py#L12
they define a special function weight_loader that allows for seamless weight loading among multiple ranks for TP.

🤖 Harold's Notes

Explorer

nano-vllm

The LLM engine

Setting up the engine

Splitting the model over multiple ranks

ModelRunner

Model implementation

Tensor-parallel linears

Graph View

Table of Contents

Backlinks