Overview of getting the .pte file

Loads the model from checkpoint and params, and sets up a LLMEdgeManager with initial source transforms and dtype conversion. Returns a LLMEdgeManager prior to calling export_to_edge with quantizers
1. `builder_exported = _prepare_for_llama_export(args).export()
Quantize the model via pte2flow, and export the model to Edge dialect and retrive a LLMEdgeManager
1. builder_exported_to_edge = builder_exported.pt2e_quantize(quantizers ).export_to_edge()
Partition the model and lower to different backends.
1. builder = builder_exported_to_edge.to_backend(partitioners)
Lower the model to executorch and get an ExecutorchProgram.
1. May apply final passes e.g. if there are Linear operations left in the graph, you can execute them with the optimized op_linear rather than materializing a transpose followed by a regular op_mm. This is done using from executorch.backends.xnnpack._passes.convert_to_linear import ConvertToLinearPass
2. builder = builder.to_executorch()
Save the model to a .pte file
1. builder.save_to_pte(output_file)

Source transformations

In examples/models/llama/source_transformation
Depending on the backend, there are transformations or passes applied on the IR
- they are specific to the transformer architecture
- More general ones are described in Passes or transformation or Passes or transformation
Transformers related to SDPA, rope, RMSNorm, and KVCache
Examples are
- if you use the qnn backend, there’s options to
  - change multi head attention to multiple single head attention i.e. for loops
  - convert linear layers to conv2d
- If you use coreml
  - special SDPA implementation torch.ops.coreml.sdpa

Lowering to backend

from executorch.extension.llm.export.partitioner_lib import (
get_coreml_partitioner,
get_mps_partitioner,
get_qnn_partitioner,
get_vulkan_partitioner,
get_xnnpack_partitioner,
)

Can have a list of partitioners and combine them
- builder = builder_exported_to_edge.to_backend(partitioners)
  - this is using the LLMEdgeManager
- ordering matters
- e.g. if we use vulkan, we first apply the vulkan_partitioner and then the xnnpack_partitioner so that undelegated ops can be accelerated by XNNPACK

🤖 Harold's Notes

Explorer

Workflow

Overview of getting the .pte file

Source transformations

Lowering to backend

Graph View

Table of Contents

Backlinks