How kernels are handled

  • An ExecuTorch program encodes instructions that describe the computation that should be performed by the program.

  • Many of these instructions will correspond to calling a specific ATen operator, for example aten.convolution.

    • one of the core design principles of ExecuTorch is that the signature of an operator should be separate from the implementation of the operator.
    • runtime does not ship with any standard implementation for ATen operators
    • users must make sure to link against kernel libraries that contain implementations of the operators required by their ExecuTorch program
  • A kernel library is simply a collection of ATen operator implementations that follow a common theme or design principle

First-party kernel libraries

Portable Kernel Library

  • https://github.com/pytorch/executorch/tree/main/kernels/portable
  • Optimized for
    • Correctness (straightforward implementations of ATen operators that are strictly consistent with the original implementation of the operator in PyTorch’s ATen library)
    • Readability
    • Portability (just as portable as the ExecuTorch runtime, no external deps or unsanctioned features of C++)
    • Operator coverage (implementation for every operator listed as a Core ATen operator)
  • Example: add.out

Optimized Kernel Library

  • https://github.com/pytorch/executorch/tree/main/kernels/optimized

  • Many operator implementations in the Optimized Kernel Library are inspired or based off of the corresponding implementation in PyTorch’s ATen library, so in many cases one can expect the same degree of performance.

  • Generally speaking, operators in the Optimized Kernel Library are optimized in one of two ways:

    1. Using CPU vector intrinsics
    2. Using optimized math libraries, such as sleef and OpenBLAS
  • Example: add.out

Quantized kernel library

Writing your custom kernel

Constraints

Out-variants only

ExecuTorch only supports out-style operators, where:

  • The caller provides the output Tensor or Tensor list in the final position with the name out.

  • The C++ function modifies and returns the same out argument.

    • If the return type in the YAML file is () (which maps to void), the C++ function should still modify out but does not need to return anything.
  • Conventionally, these out operators are named using the pattern <name>.out or <name>.<overload>_out.

  • ExecuTorch only supports operators that return a single Tensor, or the unit type () (which maps to void). It does not support returning any other types, including lists, optionals, tuples, or scalars like bool.

    • Since all output values are returned via an out parameter, ExecuTorch ignores the actual C++ function return value. But, to be consistent, functions should always return out when the return type is non-void.

Supported argument types

1. Follow Executorch internal README

2. C++ API for Custom Ops (unclear if it works yet)

Prepare custom kernel implementation

  • Define your custom operator schema for both functional variant (used in AOT compilation) and out variant (used in ExecuTorch runtime). The schema needs to follow PyTorch ATen convention (see native_functions.yaml). For example:
custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor
custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)
  • Then write your custom kernel according to the schema using ExecuTorch types, along with APIs to register to ExecuTorch runtime:
// custom_linear.h/custom_linear.cpp
#include <executorch/runtime/kernel/kernel_includes.h>
Tensor& custom_linear_out(const Tensor& weight, const Tensor& input, optional<Tensor> bias, Tensor& out) {
   // calculation
   return out;
}
 
// opset namespace myop
EXECUTORCH_LIBRARY(myop, "custom_linear.out", custom_linear_out);
  • Write a wrapper for the op to show up in Pytorch (separate .cpp file)
// custom_linear_pytorch.cpp
#include "custom_linear.h"
#include <torch/library.h>
 
at::Tensor custom_linear(const at::Tensor& weight, const at::Tensor& input, std::optional<at::Tensor> bias) {
    // initialize out
    at::Tensor out = at::empty({weight.size(1), input.size(1)});
    // wrap kernel in custom_linear.cpp into ATen kernel
    WRAP_TO_ATEN(custom_linear_out, 3)(weight, input, bias, out);
    return out;
}
// standard API to register ops into PyTorch
TORCH_LIBRARY(myop, m) {
    m.def("custom_linear(Tensor weight, Tensor input, Tensor(?) bias) -> Tensor", custom_linear);
    m.def("custom_linear.out(Tensor weight, Tensor input, Tensor(?) bias, *, Tensor(a!) out) -> Tensor(a!)", WRAP_TO_ATEN(custom_linear_out, 3));
}

Compiling and linking the custom kernel

  • In our CMakeLists.txt that builds the binary/application, we need to add custom_linear.h/cpp into the binary target.
    • We can build a dynamically loaded library (.so or .dylib) and link it as well.
#CMakeLists.txt
 
# For target_link_options_shared_lib
include(${EXECUTORCH_ROOT}/build/Utils.cmake)
 
# Add a custom op library
add_library(custom_op_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/custom_op.cpp)
 
# Include the header
target_include_directory(custom_op_lib PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include)
 
# Link ExecuTorch library
target_link_libraries(custom_op_lib PUBLIC executorch)
 
# Define a binary target
add_executable(custom_op_runner PUBLIC main.cpp)
 
# Link this library with --whole-archive !! IMPORTANT !! this is to avoid the operators being stripped by linker
target_link_options_shared_lib(custom_op_lib)
 
# Link custom op lib
target_link_libraries(custom_op_runner PUBLIC custom_op_lib)

Using it in Pytorch

  • Link it into the PyTorch runtime:
    • We need to package custom_linear.h, custom_linear.cpp and custom_linear_pytorch.cpp into a dynamically loaded library (.so or .dylib)
    • load it into our python environment
import torch
torch.ops.load_library("libcustom_linear.so/dylib")

# Now we have access to the custom op, backed by kernel implemented in custom_linear.cpp.
op = torch.ops.myop.custom_linear.default