Overview of getting the .pte file
- Loads the model from checkpoint and params, and sets up a LLMEdgeManager with initial source transforms and dtype conversion. Returns a LLMEdgeManager prior to calling export_to_edge with quantizers
- `builder_exported = _prepare_for_llama_export(args).export()
- Quantize the model via pte2flow, and export the model to Edge dialect and retrive a LLMEdgeManager
builder_exported_to_edge = builder_exported.pt2e_quantize(quantizers ).export_to_edge()
- Partition the model and lower to different backends.
builder = builder_exported_to_edge.to_backend(partitioners)
- Lower the model to executorch and get an ExecutorchProgram.
- May apply final passes e.g. if there are Linear operations left in the graph, you can execute them with the optimized op_linear rather than materializing a transpose followed by a regular op_mm. This is done using
from executorch.backends.xnnpack._passes.convert_to_linear import ConvertToLinearPass
builder = builder.to_executorch()
- May apply final passes e.g. if there are Linear operations left in the graph, you can execute them with the optimized op_linear rather than materializing a transpose followed by a regular op_mm. This is done using
- Save the model to a .pte file
builder.save_to_pte(output_file)
Source transformations
-
Depending on the backend, there are transformations or passes applied on the IR
- they are specific to the transformer architecture
- More general ones are described in Passes or transformation or Passes or transformation
-
Transformers related to SDPA, rope, RMSNorm, and KVCache
-
Examples are
-
if you use the qnn backend, there’s options to
- change multi head attention to multiple single head attention i.e. for loops
- convert linear layers to conv2d
-
If you use coreml
- special SDPA implementation
torch.ops.coreml.sdpa
- special SDPA implementation
-
Lowering to backend
- Can have a list of partitioners and combine them
builder = builder_exported_to_edge.to_backend(partitioners)
- this is using the
LLMEdgeManager
- this is using the
- ordering matters
- e.g. if we use vulkan, we first apply the
vulkan_partitioner
and then thexnnpack_partitioner
so that undelegated ops can be accelerated by XNNPACK