-
https://pytorch.org/executorch/main/llm/getting-started.html
-
https://github.com/pytorch/torchchat/blob/main/runner/run.cpp
-
(Prerequisites) Export the model to
.pte
following torch.export() ⇒ Edge Compilation -
- It hosts the libary components used in a C++ llm runner.
- stats.h on runtime status like token numbers and latency.
- TextPrefiller
- TextDecoderRunner
- With the components above, an actual runner can be built for a model or a series of models.
- An example is in /executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.
- It hosts the libary components used in a C++ llm runner.
Building the runner
- Create a file called main.cpp with the following contents
- The
Module
class handles loading the .pte file and preparing for execution.- has the forward signature and expectes
Evalue
tensor
- has the forward signature and expectes
// Load the exported nanoGPT program, which was generated via the previous
// steps.
Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors);
-
The ExecuTorch
EValue
class provides a wrapper around tensors and other ExecuTorch data types.