AI & ML

Optimizing Large-Language Model Inference with ExecuTorch 1.0 on Qualcomm Hexagon NPUs

By utilizing the ExecuTorch Qualcomm AI Engine backend, engineers can achieve near-native NPU utilization for transformer models, but must carefully map operators to QNN 2.37.0 to avoid costly fallback to CPU execution.

By AxiomLogica Editorial

Apr 12, 202615 min read

Reviewed by Editorial

Introduction: Bridging PyTorch to Mobile Hardware

ExecuTorch 1.0 is not an incremental update—it is a structural replacement for PyTorch Mobile, purpose-built for the compute and memory realities of edge silicon. As the official documentation states, "ExecuTorch is the next-generation machine learning framework for edge devices, designed to bring the PyTorch ecosystem to mobile, embedded, and edge platforms." The engineering trade-offs it encodes reflect hard lessons from production deployments that PyTorch Mobile was never designed to handle.

Two numbers frame the decision to migrate: ExecuTorch 1.0 reduces deployment binary size by up to 40% compared to legacy PyTorch Mobile, and the QNN HTP backend delivers a 3x–5x power efficiency improvement over CPU-based inference. The binary reduction comes from a strict operator kernel selection model at compile time—only the kernels actually consumed by a specific model graph are linked, eliminating the catch-all operator library that bloated PyTorch Mobile packages. The power efficiency gain is structural: Hexagon HTP executes fused integer kernels on a fixed-function datapath that consumes a fraction of the DRAM bandwidth a CPU pipeline requires for equivalent throughput.

The baseline requirement is PyTorch 2.5+. Below that version, the torch.export API lacks the stability guarantees the ExecuTorch AOT pipeline depends on.

Feature	ExecuTorch 1.0 (QNN Backend)	PyTorch Mobile	ONNX Runtime Mobile
Binary Size (MobileNet baseline)	~1.2 MB delegate	~3.5 MB	~2.8 MB
NPU Delegation	Native (HTP via QNN)	None	EP-based (limited)
Quantization Support	INT8/INT16/FP16 (HW-gated)	Dynamic only	INT8
AOT Graph Compilation	Yes (`torch.export`)	No (JIT)	Partial
Runtime Init Latency	~70% lower vs. JIT	Baseline	Moderate
Power Efficiency vs CPU	3x–5x	1x	1.2x–1.5x

Configuring the QNN 2.37.0 Environment

Environment misconfiguration is the most common source of silent build failures in the ExecuTorch QNN pipeline. The toolchain spans host-side compilation (Ubuntu 22.04 LTS, G++ 13+), cross-compilation (Android NDK 26c), and device-side runtime (QNN SDK 2.37.0 Hexagon libraries). Each layer has its own library search paths, and linker conflicts between host and device artifacts will not always produce obvious errors.

QNN SDK 2.37.0 requires Ubuntu 22.04 LTS or newer. Android NDK 26c is the verified minimum version for proper cross-compilation of QNN libraries. Attempting to use NDK 25b or earlier produces __builtin_available resolution failures during the HTP stub compilation step that are difficult to trace back to the NDK version.

Technical Warning: As the QNN SDK documentation specifies, "Ensure QNN_SDK_ROOT is explicitly exported in your bashrc to prevent library linker conflicts between host-side and device-side artifacts." Sourcing it only in a terminal session is insufficient for build systems that spawn subprocesses.

The following script validates the full environment before invoking any ExecuTorch build target:

#!/usr/bin/env bash
# validate_qnn_env.sh
# Purpose: Assert all required env vars and binary versions before AOT compilation.
# Run this before any cmake or buck2 invocation targeting the QNN backend.

set -euo pipefail

# --- QNN SDK Validation ---
: "${QNN_SDK_ROOT:?ERROR: QNN_SDK_ROOT is not set. Export it in ~/.bashrc}"
QNN_LIB="${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so"
if [ ! -f "${QNN_LIB}" ]; then
  echo "ERROR: HTP device library not found at ${QNN_LIB}"
  echo "Verify QNN SDK 2.37.0 package integrity."
  exit 1
fi
echo "QNN HTP library: OK (${QNN_LIB})"

# --- NDK Version Check ---
: "${ANDROID_NDK_ROOT:?ERROR: ANDROID_NDK_ROOT is not set.}"
NDK_SOURCE_PROPS="${ANDROID_NDK_ROOT}/source.properties"
NDK_REV=$(grep "Pkg.Revision" "${NDK_SOURCE_PROPS}" | cut -d'=' -f2 | tr -d ' ')
echo "Android NDK revision: ${NDK_REV}"
# NDK 26c maps to revision 26.3.x
MAJOR=$(echo "${NDK_REV}" | cut -d'.' -f1)
if [ "${MAJOR}" -lt 26 ]; then
  echo "ERROR: NDK major version ${MAJOR} is below minimum required (26)."
  exit 1
fi
echo "Android NDK version: OK"

# --- G++ Version Check (host-side compilation) ---
GPP_VERSION=$(g++ --version | head -n1 | grep -oP '\d+\.\d+\.\d+' | head -n1)
GPP_MAJOR=$(echo "${GPP_VERSION}" | cut -d'.' -f1)
if [ "${GPP_MAJOR}" -lt 13 ]; then
  echo "ERROR: g++ version ${GPP_VERSION} is below minimum required (13.x)."
  exit 1
fi
echo "g++ version: OK (${GPP_VERSION})"

# --- Python / PyTorch Baseline ---
python3 -c "
import torch
major, minor = map(int, torch.__version__.split('.')[:2])
assert (major, minor) >= (2, 5), f'PyTorch {torch.__version__} < 2.5 required'
print(f'PyTorch version: OK ({torch.__version__})')
"

echo "--- Environment validation passed ---"

Ensure LD_LIBRARY_PATH on the target Android device also includes the QNN HTP stub path. Missing device-side path configuration produces dlopen failures at runtime that surface as a generic QNNBACKEND_INIT_FAILED error code, not a path error.

Mastering the AOT Compilation Pipeline

The AOT flow converts torch.export graphs directly to QNN context binaries (.serialized_qnn / QNN-binary format), reducing runtime initialization latency by approximately 70% versus JIT-based approaches. That latency reduction is not a free optimization—it is the direct consequence of relocating all graph compilation and kernel selection work from the device's first inference call to the developer's workstation at build time.

The pipeline has three discrete stages: graph capture via torch.export.export(), partitioning via the QnnPartitioner, and lowering via to_edge() + to_executorch(). Each stage can fail independently, and the failure modes differ.

Pro-Tip: Graph capture requires torch.export.export() for full compatibility with the ExecuTorch partitioner. Using torch.jit.trace produces a TorchScript graph, not an ATen-level FX graph, and the QNN partitioner will reject it silently by treating the entire graph as unsupported.

# aot_compile_qnn.py
# Production AOT compilation pipeline following the deeplab_v3 pattern.
# Targets: Snapdragon 8 Gen 2/3/Elite, QNN SDK 2.37.0, ExecuTorch 1.0

import torch
from torch.export import export, ExportedProgram
from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner
from executorch.backends.qualcomm.utils.utils import (
    generate_qnn_executorch_compiler_spec,
    QcomChipset,
)
from executorch.exir import to_edge, EdgeCompileConfig
from executorch.exir.backend.backend_api import to_backend
import executorch.exir as exir

# --- 1. Define model and representative inputs ---
# Using a quantized transformer model; FP32 paths WILL fall back to CPU.
# Ensure model is pre-quantized with torch.ao.quantization before this step.
model = torch.load("llm_quantized_int8.pt", map_location="cpu").eval()

# Representative inputs must match production shapes exactly.
# Dynamic shapes require explicit Dim specification (see below).
example_inputs = (
    torch.zeros(1, 512, dtype=torch.int32),   # input_ids
    torch.zeros(1, 512, dtype=torch.int32),   # attention_mask
)

# Optional: declare dynamic sequence length to avoid shape-specialized recompilation.
from torch.export import Dim
seq_dim = Dim("seq_len", min=1, max=512)
dynamic_shapes = (
    {1: seq_dim},  # input_ids: dynamic on dim 1
    {1: seq_dim},  # attention_mask: dynamic on dim 1
)

# --- 2. Graph Capture (ATen-level FX graph) ---
exported_program: ExportedProgram = export(
    model,
    example_inputs,
    dynamic_shapes=dynamic_shapes,
)

# --- 3. Lower to Edge IR ---
edge_program = to_edge(
    exported_program,
    compile_config=EdgeCompileConfig(_check_ir_validity=True),
)

# --- 4. Partition and delegate to QNN HTP ---
# QcomChipset.SNAPDRAGON_8_GEN3 targets the SM8650 HTP architecture.
# Replace with SNAPDRAGON_8_ELITE for SM8750 deployments.
compiler_specs = generate_qnn_executorch_compiler_spec(
    soc_model=QcomChipset.SNAPDRAGON_8_GEN3,
    backend_options={
        "online_prepare": False,  # Enforce full AOT compilation; reject runtime fallback.
        "dump_intermediate_outputs": False,
    },
)

partitioned_program = edge_program.to_backend(
    QnnPartitioner(compiler_specs)
)

# --- 5. Lower to ExecuTorch runtime format ---
et_program = partitioned_program.to_executorch(
    exir.ExecutorchBackendConfig(
        extract_delegate_segments=True,  # Separate QNN binary segment for mmap loading.
    )
)

# --- 6. Serialize to .pte file ---
output_path = "llm_qnn_snapdragon8gen3.pte"
with open(output_path, "wb") as f:
    f.write(et_program.buffer)

print(f"Compiled artifact written to: {output_path}")

Technical Warning: Setting online_prepare=True allows the QNN runtime to finalize graph preparation on-device. This defeats the 70% latency reduction from AOT compilation and reintroduces the possibility of device-specific partitioning failures that are difficult to reproduce on host. Never enable it in production artifacts.

Model Partitioning and Operator Coverage

The partitioner operates on a sub-graph basis. As the ExecuTorch documentation states: "If a sub-graph exceeds HTP memory allocation during partitioning, the ExecuTorch delegate will automatically fall back to CPU execution." This fallback is silent by default—your .pte file will build without error, but execution will be dramatically slower than expected.

The get_unsupported_operators utility returns a list of PyTorch ops currently lacking QNN 2.37.0 parity, typically covering advanced dynamic-shape ops. Run this audit before invoking the full compilation pipeline:

# audit_operator_coverage.py
# Purpose: Identify ops that will force CPU fallback before committing to full compilation.

import torch
from torch.export import export
from executorch.backends.qualcomm.utils.utils import get_unsupported_operators
from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner
from executorch.backends.qualcomm.utils.utils import (
    generate_qnn_executorch_compiler_spec,
    QcomChipset,
)
from executorch.exir import to_edge, EdgeCompileConfig

model = torch.load("llm_quantized_int8.pt", map_location="cpu").eval()
example_inputs = (
    torch.zeros(1, 512, dtype=torch.int32),
    torch.zeros(1, 512, dtype=torch.int32),
)

exported_program = export(model, example_inputs)
edge_program = to_edge(
    exported_program,
    compile_config=EdgeCompileConfig(_check_ir_validity=True),
)

compiler_specs = generate_qnn_executorch_compiler_spec(
    soc_model=QcomChipset.SNAPDRAGON_8_GEN3,
)

# Returns a dict: {op_name: reason_string}
# A non-empty result means partial CPU fallback WILL occur.
unsupported = get_unsupported_operators(
    edge_program.exported_program(),
    QnnPartitioner(compiler_specs),
)

if unsupported:
    print(f"WARNING: {len(unsupported)} ops will fall back to CPU:")
    for op_name, reason in unsupported.items():
        print(f"  [{op_name}]  Reason: {reason}")
else:
    print("All operators are HTP-delegatable. Full NPU execution confirmed.")

When this audit returns non-empty results, the remediation path is one of three options: replace the unsupported op with a QNN-compatible equivalent, fuse it into a surrounding supported sub-graph using a custom composite op, or accept the CPU fallback with a quantified performance penalty. Accepting an unknown fallback in production is not an option—measure the latency delta explicitly.

Handling QNN Backend Constraints

The HTP tensor processor natively supports 8-bit and 16-bit integer precision. FP16 support exists but is gated to specific Snapdragon 8 series SKUs and is not universally available across Gen 2, Gen 3, and Elite variants. This hardware variance is the primary source of operator support inconsistencies across device families.

Transformer attention mechanisms must be fused into QNN-compatible kernels to avoid CPU fallback. An unfused multi-head attention implementation exposes scaled_dot_product_attention as a composite operation whose internal ops—particularly the dynamic softmax normalization path—may not map to HTP primitives in QNN 2.37.0.

Transformer Operation	HTP Support (QNN 2.37.0)	Notes
`nn.Linear` (INT8)	✅ Full	Primary LLM workload; native HTP kernel
`nn.LayerNorm`	✅ Full	INT8/INT16 paths supported
`nn.Embedding` (INT8)	✅ Full	Lookup table on HTP
`F.scaled_dot_product_attention` (fused)	✅ Partial	Requires explicit fusion; FP32 fallbacks
Rotary Position Embedding (RoPE)	⚠️ Partial	Static shapes only; dynamic shapes → CPU
`nn.GELU` (approximate)	✅ Full	Polynomial approximation supported
`nn.GELU` (exact / `erf` path)	❌ Fallback	`erf` not in HTP kernel set for QNN 2.37.0
Dynamic `torch.where` / conditional	❌ Fallback	Data-dependent control flow unsupported
FP32 Matrix Multiply	⚠️ Limited	Elite SKU FP16 only; Gen 2/3 → CPU
`torch.topk` (dynamic k)	❌ Fallback	Static k values only on HTP

Pro-Tip: Replace exact GELU (F.gelu(x, approximate='none')) with the tanh approximation (F.gelu(x, approximate='tanh')) in your model definition before export. This single substitution eliminates the erf operator and keeps the entire FFN sub-graph on HTP.

Advanced Optimization for Snapdragon 8 Series

The Snapdragon 8 Elite NPU (SM8750) provides double the performance per watt compared to the Snapdragon 8 Gen 3 (SM8650) architecture for LLM token generation. That doubling is not purely from clock frequency—the Elite's HTP architecture increases the Tensor Accelerator Array width and adds dedicated sparsity exploitation hardware that activates automatically when weight tensors meet the structured pruning threshold.

To target the Elite, change QcomChipset.SNAPDRAGON_8_GEN3 to QcomChipset.SNAPDRAGON_8_ELITE in the generate_qnn_executorch_compiler_spec call. The compiler will then emit HTP context binaries that use SM8750-specific kernel dispatch paths. Binaries compiled for the Elite will fail initialization on Gen 3 hardware—maintain separate build targets per chipset family.

The following diagram shows the complete data path from PyTorch graph to Hexagon NPU memory:

flowchart TD
    A["PyTorch Model (.pt / .pth)"] --> B["torch.export.export()\nATen FX Graph"]
    B --> C["to_edge()\nEdge IR"]
    C --> D["QnnPartitioner\nSub-graph Analysis"]
    D --> E{"Operator\nSupported?"}
    E -- Yes --> F["QNN Delegate Sub-graph\n(HTP-bound ops)"]
    E -- No --> G["XNNPACK / CPU Sub-graph\n(fallback ops)"]
    F --> H["generate_qnn_executorch_compiler_spec\nQNN Context Binary (.serialized_qnn)"]
    H --> I["to_executorch()\n.pte Artifact\nextract_delegate_segments=True"]
    G --> I
    I --> J["Android Device ADB Push"]
    J --> K["ExecuTorch Runtime\nExecutor::load_method()"]
    K --> L["QNN HTP Delegate\ndeserialization + HTP init"]
    L --> M["Hexagon HTP\nTensor Accelerator Array"]
    M --> N["HTP VTCM\n(Vector TCM — L2 Cache)"]
    N --> O["DDR LPDDR5X\nmmap-loaded weights"]
    K --> P["XNNPACK Runtime\nCPU fallback ops"]
    P --> Q["ARM Cortex-X4\nDDR Access"]

Unaligned buffers in HTP memory space cause bus access penalties. Explicitly align all weight buffers to 8-byte boundaries during model preparation. The alignment requirement is structural: Hexagon's memory bus issues 8-byte aligned loads, and unaligned access traps to a slow-path microcode handler that can degrade throughput by 20–40% on memory-bound layers.

Memory Management for Large Language Models

Memory-mapped file loading reduces peak resident memory usage during model loading by 30–50% on Android devices. This range reflects the difference between fully loading a .pte file into RAM versus mapping the delegate segment into the virtual address space and letting the OS page in only the weights accessed during the current inference pass.

ExecuTorch's extract_delegate_segments=True flag (set in the compilation step above) is the prerequisite—it packages the QNN context binary as a separate, page-aligned segment within the .pte file, making it mmap-compatible. Without this flag, the binary is embedded in the flatbuffer header and cannot be memory-mapped independently.

Memory Constraint: Buffer allocation must respect the 8-byte alignment requirement of the Hexagon HTP architecture. Any custom allocator passed to the ExecuTorch runtime must guarantee 8-byte alignment at minimum; 64-byte alignment is preferred for VTCM-resident activations to match cache line granularity.

// memory_config_qnn.cpp
// Configure ExecuTorch runtime with mmap-compatible memory allocation for LLM weights.
// Target: Android, ExecuTorch 1.0, QNN HTP delegate

#include <executorch/runtime/executor/method.h>
#include <executorch/runtime/core/memory_allocator.h>
#include <executorch/extension/memory_allocator/mmap_memory_allocator.h>
#include <executorch/extension/data_loader/mmap_data_loader.h>

using namespace torch::executor;
using namespace torch::executor::util;

// MmapDataLoader maps the .pte file into virtual address space.
// The OS pages in weights on demand, capping resident memory usage.
Result<MmapDataLoader> loader = MmapDataLoader::from(
    "/data/local/tmp/llm_qnn_snapdragon8gen3.pte",
    MmapDataLoader::MlockConfig::NoMlock  // Do NOT mlock: let OS manage page eviction.
);

// HierarchicalAllocator with 8-byte aligned buffers for HTP compatibility.
// Planned memory layout: planned_buffers must be pre-sized from model metadata.
constexpr size_t kMethodAllocatorPoolSize = 32 * 1024U * 1024U;  // 32 MB method pool
// Use __attribute__((aligned(64))) for VTCM-destined activation buffers.
alignas(64) static uint8_t method_allocator_pool[kMethodAllocatorPoolSize];

MemoryAllocator method_allocator(
    sizeof(method_allocator_pool),
    method_allocator_pool
);

On devices where LPDDR5X bandwidth is the bottleneck (common during prefill for long-context LLMs), combine mmap loading with prefetching the KV-cache weight pages using madvise(MADV_SEQUENTIAL) before the first inference call.

Debugging and Profiling Inference Performance

Enabling QNN_LOG_LEVEL=DEBUG generates trace logs that identify specific op-latency spikes at the HTP boundary. This environment variable controls the QNN runtime's internal logging verbosity, not ExecuTorch's logging layer—both must be configured independently.

Technical Warning: Verbose logging must only be enabled in development builds. The I/O overhead from writing detailed HTP execution traces to Android's /data/local/tmp on eMMC or UFS storage can inflate measured inference latency by 3–8x, making profiling data non-representative of production performance.

Use these flags together to get actionable profiling data:

# Enable QNN backend verbose logging and ExecuTorch profiling simultaneously.
# Run via adb shell on the target device.

# Set QNN log level before invoking the inference binary.
export QNN_LOG_LEVEL=DEBUG          # HTP-level op trace
export QNN_LOG_MASK=0xFFFF          # All subsystems: HTP, memory, scheduling
export ET_LOG_LEVEL=DEBUG           # ExecuTorch runtime logging

# Invoke inference with profiling output to file.
/data/local/tmp/executorch_runner \
  --model_path /data/local/tmp/llm_qnn_snapdragon8gen3.pte \
  --input_path /data/local/tmp/input_ids.bin \
  --output_path /data/local/tmp/output.bin \
  --profile_output /data/local/tmp/profile_trace.json \
  --num_inference_iterations 50 \
  --warmup_iterations 5

A representative profiling trace output for a transformer decoder layer looks like:

[QNN][DEBUG][HTP] Op: qnn_matmul_int8 | Layer: decoder.0.attn.q_proj | Lat: 0.42ms | Mem: 1.2MB
[QNN][DEBUG][HTP] Op: qnn_elementwise_add | Layer: decoder.0.attn.residual | Lat: 0.08ms | Mem: 0.3MB
[QNN][DEBUG][HTP] Op: qnn_layernorm_int8 | Layer: decoder.0.ln_1 | Lat: 0.11ms | Mem: 0.4MB
[QNN][WARN][HTP] Op: aten.gelu.default | Layer: decoder.0.mlp.act | FALLBACK: CPU | Lat: 3.87ms
[QNN][DEBUG][HTP] Op: qnn_matmul_int8 | Layer: decoder.0.mlp.fc1 | Lat: 0.39ms | Mem: 2.1MB

The FALLBACK: CPU line in the trace above is the diagnostic signal. The aten.gelu.default (exact path) op took 3.87ms on CPU versus sub-millisecond HTP execution times for equivalent-size matmul operations. This directly confirms the erf-path GELU constraint from the operator coverage table. Switching to the tanh approximation would relocate this layer back to HTP and recover approximately 3.5ms per decoder layer—at 32 layers, that is 112ms per forward pass.

To extract only the CPU fallback ops from a long trace programmatically:

# Parse profiling trace for CPU fallback events.
grep "FALLBACK: CPU" /data/local/tmp/profile_trace.json \
  | awk '{print $6, $8, $NF}' \
  | sort -t: -k2 -rn \
  | head -20
# Output: Layer name, op name, latency (descending) — prioritize highest-latency fallbacks first.

Conclusion: The Future of High-Performance Edge AI

Moving inference from CPU to Hexagon NPU produces a measured 10x throughput gain for transformer-based token generation on flagship Snapdragon chips. That figure is not theoretical—it reflects the compound effect of eliminated DRAM round-trips for intermediate activations, native INT8 vector execution without scalar conversion overhead, and sustained HTP utilization when operator coverage is complete.

The critical constraint for production deployment is strict QNN SDK versioning. Mismatches between the SDK version used during AOT compilation and the runtime library deployed on-device result in silent kernel panics or unrecoverable runtime failures—not graceful degradation. Pin QNN_SDK_ROOT to 2.37.0 in your CI environment and validate it with the environment script above on every build.

ExecuTorch and the Qualcomm Hexagon HTP backend represent the current performance ceiling for PyTorch-derived models on Android silicon. The gap between a fully HTP-delegated model and one with even a handful of CPU fallbacks is not a percentage—it is an order of magnitude. Every fallback op is a latency tax. Audit your operator coverage before deployment, eliminate GELU exact paths and dynamic-shape ops, pin your SDK versions, and align your buffers. The hardware is capable; the question is whether your compilation pipeline is precise enough to use it fully.

Keywords: ExecuTorch, Qualcomm Hexagon NPU, QNN SDK, AOT Compilation, Model Partitioning, Transformer Optimization, Snapdragon 8 Elite, HTP Backend, Operator Fusion, PyTorch Edge

Was this guide helpful?

Share: X · LinkedIn · Reddit

Introduction: Bridging PyTorch to Mobile Hardware

Configuring the QNN 2.37.0 Environment

Mastering the AOT Compilation Pipeline

Model Partitioning and Operator Coverage

Handling QNN Backend Constraints

Advanced Optimization for Snapdragon 8 Series

Memory Management for Large Language Models

Debugging and Profiling Inference Performance

Conclusion: The Future of High-Performance Edge AI

The weekly brief.

Related reading

Optimizing LLM Inference: Implementing AWQ and Speculative Decoding for Production Latency

Architecting for Disaggregated LLM Inference: Prefill-Decode Isolation

Engineering the Quantized Johnson-Lindenstrauss (QJL) Transform for Distributed Inference