Overview of getting the .pte file

  1. Loads the model from checkpoint and params, and sets up a LLMEdgeManager with initial source transforms and dtype conversion. Returns a LLMEdgeManager prior to calling export_to_edge with quantizers
    1. `builder_exported = _prepare_for_llama_export(args).export()
  2. Quantize the model via pte2flow, and export the model to Edge dialect and retrive a LLMEdgeManager
    1. builder_exported_to_edge = builder_exported.pt2e_quantize(quantizers ).export_to_edge()
  3. Partition the model and lower to different backends.
    1. builder = builder_exported_to_edge.to_backend(partitioners)
  4. Lower the model to executorch and get an ExecutorchProgram.
    1. May apply final passes e.g. if there are Linear operations left in the graph, you can execute them with the optimized op_linear rather than materializing a transpose followed by a regular op_mm. This is done using from executorch.backends.xnnpack._passes.convert_to_linear import ConvertToLinearPass
    2. builder = builder.to_executorch()
  5. Save the model to a .pte file
    1. builder.save_to_pte(output_file)

Source transformations

  • In examples/models/llama/source_transformation

  • Depending on the backend, there are transformations or passes applied on the IR

  • Transformers related to SDPA, rope, RMSNorm, and KVCache

  • Examples are

    • if you use the qnn backend, there’s options to

      • change multi head attention to multiple single head attention i.e. for loops
      • convert linear layers to conv2d
    • If you use coreml

      • special SDPA implementation torch.ops.coreml.sdpa

Lowering to backend

from executorch.extension.llm.export.partitioner_lib import (
get_coreml_partitioner,
get_mps_partitioner,
get_qnn_partitioner,
get_vulkan_partitioner,
get_xnnpack_partitioner,
)
  • Can have a list of partitioners and combine them
    • builder = builder_exported_to_edge.to_backend(partitioners)
      • this is using the LLMEdgeManager
    • ordering matters
    • e.g. if we use vulkan, we first apply the vulkan_partitioner and then the xnnpack_partitioner so that undelegated ops can be accelerated by XNNPACK