https://pytorch.org/executorch/main/llm/getting-started.html
https://github.com/pytorch/torchchat/blob/main/runner/run.cpp
(Prerequisites) Export the model to .pte following torch.export() ⇒ Edge Compilation
Runner utils in Executorch
- It hosts the libary components used in a C++ llm runner.
  - stats.h on runtime status like token numbers and latency.
  - TextPrefiller
  - TextDecoderRunner
- With the components above, an actual runner can be built for a model or a series of models.
- An example is in /executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.

Building the runner

Create a file called main.cpp with the following contents

// main.cpp
 
#include <cstdint>
 
#include "basic_sampler.h"
#include "basic_tokenizer.h"
 
#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
#include <executorch/runtime/core/evalue.h>
#include <executorch/runtime/core/exec_aten/exec_aten.h>
cd#include <executorch/runtime/core/result.h>
 
using executorch::aten::ScalarType;
using executorch::aten::Tensor;
using executorch::extension::from_blob;
using executorch::extension::Module;
using executorch::runtime::EValue;
using executorch::runtime::Result;

The Module class handles loading the .pte file and preparing for execution.
- has the forward signature and expectes Evalue tensor

 // Load the exported nanoGPT program, which was generated via the previous
  // steps.
  Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors);

The ExecuTorch EValue class provides a wrapper around tensors and other ExecuTorch data types.

🤖 Harold's Notes

Explorer

Invoking the runtime

Building the runner

Graph View

Backlinks