At a glance: what you need before deploying
At a Glance: Time: 3–6 hours end-to-end · Prereqs: Python 3.10+, PyTorch 2.4.4, coremltools, Hugging Face
transformers, orpip install executorch· Hardware: Apple silicon Mac (M1 Max or later for the reference path; A14 or later for iOS ANE targets) · Cost: No cloud compute required — all conversion and inference runs locally
Core ML and ExecuTorch are the two practical on-device deployment paths for quantized LLMs on Apple silicon in 2026. Neither is universally correct; the choice turns on your export format, target OS, and operator coverage requirements.
Apple's reference walkthrough demonstrates Llama-3.1-8B-Instruct running locally on a Mac with M1 Max at approximately ~33 tokens/s decoding speed after Core ML conversion and optimization. As Apple's engineering team states directly: "We use a Mac with M1 Max and specifically target the GPU, as the models like the Llama-3.1-8B-Instruct are usually constrained by memory bandwidth, and the GPU offers the best combination of compute FLOPS and memory bandwidth on the device of our interest."
This means the practical bottleneck for on-device LLM deployment is not simply quantization — it is shaping the export path to fit Apple silicon's memory-bandwidth constraints, static-shape requirements, and backend-specific operator coverage. The Apple Neural Engine adds throughput headroom when operators qualify for ANE execution, but the A11's first-generation ANE peaked at 0.6 TFLOPS FP16; Apple's own ANE transformer guidance targets A14-and-later or M1-and-later chips for meaningful transformer acceleration.
The minimum deployment target for the Core ML export path is macOS 13, paired with PyTorch 2.4.4 and coremltools. For iOS deployment with ExecuTorch, iOS 18 is the relevant target for current backend features.
Why Core ML and ExecuTorch are the two paths worth comparing
Core ML is Apple's model format and inference runtime. You convert a PyTorch model once, and the .mlpackage or .mlmodelc artifact ships directly into an Apple app via the CoreML framework — no PyTorch dependency at runtime. ExecuTorch is PyTorch's own edge-deployment pipeline: you export with torch.export, lower to a backend with to_edge_transform_and_lower, and produce a .pte artifact that the ExecuTorch runtime executes.
These are not equivalent paths with interchangeable outcomes. ExecuTorch on Apple platforms actually delegates execution to Core ML or MPS backends, as the official docs confirm: "ExecuTorch also provides backends to leverage Core ML and Metal Performance Shaders (MPS) for hardware-accelerated execution on Apple platforms." The ExecuTorch README also notes it supports Apple, Qualcomm, ARM, MediaTek, Vulkan, and more via 12+ hardware backends — which is its strength for cross-platform teams, and its risk for teams that treat it as plug-and-play without checking backend coverage first. The same README also describes the base runtime footprint as about 50KB before backend libraries and kernel registrations are added.
| Dimension | Core ML | ExecuTorch |
|---|---|---|
| Export path | PyTorch → coremltools conversion → .mlpackage |
torch.export → to_edge_transform_and_lower → .pte |
| ANE/GPU access | Direct, via Core ML compiler | Indirect, via Core ML or MPS backend delegate |
| Deployment target | macOS 13+, iOS 16+, native Apple app | iOS 18+, macOS, embedded — via xcframework |
| Runtime dependency | Zero (system framework) | ExecuTorch runtime (~50KB base) + backend |
| Primary failure mode | Static-shape violations, unsupported ops at compile time | Backend lowering failure, operator fallback, quantization mismatch |
The competitive gap in most documentation is that Core ML guides and ExecuTorch guides treat these as independent stories. They are not: ExecuTorch's Apple path converges on Core ML or MPS at the backend level, so understanding the Core ML operator and shape constraints is prerequisite knowledge for either path.
When Core ML is the safer choice for Apple Neural Engine
Core ML gives you direct control over ANE placement without intermediary backend delegation. When you use coremltools to compile and deploy, the Core ML compiler decides — at model compilation time — whether subgraphs route to the ANE, GPU, or CPU. Apple's research confirms that ANE-oriented implementations can deliver "significantly increased throughput and reduced memory consumption" compared with hybrid CPU/GPU execution, as described in the Apple Neural Engine transformers guidance.
The catch is operator eligibility. ANE execution requires that your exported graph uses only ANE-eligible ops, and that input shapes are static and match what the compiler observed at trace time. Apple's LLM deployment example deliberately uses fixed input shapes and a no-KV-cache baseline to keep the exported graph stable and compiler-friendly.
Pro Tip: Before finalizing your Core ML export, audit your model's operator set using
coremltools.convert()withcompute_units=ct.ComputeUnit.ALLand then inspect the generated.mlpackagein Xcode's Core ML inspector. Operations that fail ANE placement appear annotated with their fallback device. If key attention or linear layers fall back to CPU, you have a shape or op eligibility issue — not a quantization issue.
Core ML is the default-correct choice when: you target Apple platforms exclusively, you want zero runtime dependency on PyTorch, and you can accept the static-shape constraint at the model level.
When ExecuTorch is the better fit for a PyTorch-first stack
ExecuTorch is fully supported on Apple silicon. The runtime ships as prebuilt .xcframework binary targets for ARM64 devices and simulators, covering both iOS and macOS, with Objective-C, Swift, and C++ integration APIs. The ExecuTorch docs confirm: "ExecuTorch supports both iOS and macOS via Objective-C, Swift, and C++."
The advantage over a pure Core ML path is the export pipeline itself: torch.export plus backend lowering is a single Python workflow that can target Apple, Qualcomm, ARM, MediaTek, or Vulkan backends without rewriting the model twice. For teams maintaining models that ship to Android and iOS, ExecuTorch provides a unified export story.
Watch Out: Backend support is not automatic. ExecuTorch's Apple path may route to Core ML or MPS depending on the operator set your model exposes. If a layer uses a non-delegatable op, the lowering step can silently fall back to the portable CPU kernel rather than failing loudly. Before choosing a quantization scheme, verify that the quantized operator set you intend to use is fully lowerable to your target Apple backend — check the ExecuTorch Core ML backend docs against your model's operator signature. Picking int4 group-wise quantization and discovering that the delegate cannot lower it costs you a full re-export cycle.
Prepare the model for export without breaking static-shape constraints
Both Core ML and ExecuTorch export paths fail the same way when you hand them an LLM with dynamic shapes: the export trace encounters a variable-length sequence and either throws an error or produces a graph that the downstream compiler cannot optimize for fixed-accelerator placement.
Apple's walkthrough addresses this directly: "To make the model exportable we define a thin wrapper on top of the LlamaForCausalLM class." and "This wrapped model uses fixed input shapes and no key-value caching (we go over that in subsequent sections)."
The wrapper exists to make three things explicit at trace time: the batch size, the maximum sequence length, and the attention-mask shape. Without this, PyTorch's tracing infrastructure encounters dynamic conditionals inside the HuggingFace Llama implementation that Core ML cannot represent.
Production Note: The baseline export stage intentionally omits KV cache. This means decoding throughput at this stage is not representative of a production autoregressive decode loop — each forward pass re-encodes the full context rather than appending a single new token. The Apple walkthrough treats the no-cache baseline as a correctness checkpoint before adding cache logic in subsequent optimization stages. Do not benchmark tokens/s from this baseline and compare it against KV-cache-optimized endpoints.
Your export toolchain must match: PyTorch 2.4.4, coremltools, and macOS 13 as the minimum deployment target. Version drift between PyTorch and coremltools is the most common silent-failure mode — coremltools pins its supported torch.jit.trace / torch.export interface to specific PyTorch minor versions.
Choose a quantization scheme that the silicon actually accelerates
Quantization improves LLM inference speed on Apple silicon, but only when the quantized operator maps cleanly to an accelerated path on the target backend. Apple's own foundation models use low-bit palettization for on-device inference — "a critical optimization technique that achieves the necessary memory, power, and performance requirements" — and the 2025 Apple foundation models tech report describes 2-bit quantization-aware training for on-device efficiency. These are not general-purpose recipes: Apple's internal models are purpose-built with quantization baked into training, not applied post-hoc.
For third-party models like Llama-3.1-8B-Instruct, post-training quantization with coremltools or ExecuTorch's quantizer is the practical path. The table below covers the trade-offs on Apple silicon:
| Scheme | ANE Compatibility | GPU Compatibility | Memory Reduction vs FP16 | Throughput Impact | Risk |
|---|---|---|---|---|---|
| FP16 | Partial (op-dependent) | High | 0% | Baseline | Low — most ops supported |
| int8 weight-only | Partial | High | ~50% | Moderate gain | Low — widely supported in coremltools |
| int4 weight-only | Limited | High on M-series | Unsupported as a universal claim | Hardware- and backend-dependent | Medium — verify op coverage |
| Mixed precision (int4 weights + FP16 activations) | Varies by backend | Varies by backend | Unsupported as a universal claim | Backend-dependent | Medium-high — verify op coverage |
The benchmarked Apple Core ML path targets the GPU on M1 Max because memory-bandwidth limits dominate decode latency for 8B-parameter models, as shown in the Apple Core ML on-device Llama walkthrough. Choosing int4 because it sounds fast does not guarantee faster inference if the backend routes those ops to the CPU portable kernel — bandwidth savings only materialize when the accelerated path actually handles the quantized matmul.
Set fixed input shapes and handle attention masks explicitly
The thin wrapper for Core ML export must make input shapes concrete at the Python level before you call torch.jit.trace or torch.export.export. Here is a production-representative wrapper for Llama-3.1-8B-Instruct:
import torch
from transformers import LlamaForCausalLM
class ExportableLlamaWrapper(torch.nn.Module):
"""Wraps LlamaForCausalLM with fixed shapes for coremltools tracing.
batch_size and seq_len must be concrete integers — no symbolic dims.
Attention mask is passed explicitly so the Core ML compiler can
represent it as a fixed-rank input rather than inferring it dynamically.
"""
def __init__(self, model: LlamaForCausalLM):
super().__init__()
self.model = model
def forward(
self,
input_ids: torch.Tensor, # shape: (batch_size, seq_len)
attention_mask: torch.Tensor, # shape: (batch_size, seq_len)
) -> torch.Tensor:
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
use_cache=False, # disabled for baseline export stage
)
return outputs.logits # (batch_size, seq_len, vocab_size)
# Instantiate with concrete shapes
BATCH_SIZE = 1
SEQ_LEN = 512 # fix to your maximum context window for this export
base_model = LlamaForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
)
wrapper = ExportableLlamaWrapper(base_model).eval()
# Dummy inputs with the exact fixed shapes used at trace time
dummy_input_ids = torch.zeros((BATCH_SIZE, SEQ_LEN), dtype=torch.int32)
dummy_attention_mask = torch.ones((BATCH_SIZE, SEQ_LEN), dtype=torch.int32)
The use_cache=False flag is load-bearing: KV cache introduces conditional branching that the baseline static-shape export cannot represent. The attention mask is passed as an explicit input rather than computed inside forward because Core ML needs it as a named, fixed-rank input tensor in the compiled model's interface.
Export the model with Core ML Tools and verify ANE placement
Step 1: Install dependencies
$ pip install torch==2.4.4 torchvision torchaudio
$ pip install coremltools
$ pip install transformers accelerate
Verify coremltools can find your PyTorch installation before proceeding:
$ python -c "import coremltools as ct; import torch; print(ct.__version__, torch.__version__)"
If the versions are mismatched, coremltools will fail silently during ct.convert() or produce a model that does not compile for your deployment target.
Step 2: Trace and convert to Core ML
import coremltools as ct
import torch
# wrapper and dummy inputs defined in the previous section
traced_model = torch.jit.trace(
wrapper,
(dummy_input_ids, dummy_attention_mask),
)
mlmodel = ct.convert(
traced_model,
inputs=[
ct.TensorType(name="input_ids", shape=(BATCH_SIZE, SEQ_LEN), dtype=int),
ct.TensorType(name="attention_mask", shape=(BATCH_SIZE, SEQ_LEN), dtype=int),
],
outputs=[
ct.TensorType(name="logits", dtype=float),
],
minimum_deployment_target=ct.target.macOS13,
# ALL lets the Core ML compiler decide GPU vs ANE vs CPU per subgraph
compute_units=ct.ComputeUnit.ALL,
convert_to="mlprogram", # .mlpackage format; required for ANE on M-series
)
mlmodel.save("llama-3.1-8b-instruct-baseline.mlpackage")
After saving, open the .mlpackage in Xcode's Core ML inspector to confirm that weight-carrying layers (linear projections, attention) land on GPU or ANE rather than CPU. If they appear on CPU, the operator set or shape violated a constraint — revisit the wrapper and trace inputs.
The Apple Core ML on-device Llama walkthrough describes this conversion as the starting point of an iterative optimization process — the baseline export establishes correctness before quantization and cache-optimization stages.
Run the export on Apple silicon and inspect the compiled model
The Apple reference benchmark is specific to hardware and export configuration. "[T]his model can be run locally on a Mac with M1 Max with about ~33 tokens/s" after Core ML conversion and Apple's full optimization stack — not from the baseline no-cache export alone.
| Model | Device | Backend | Quantization | Decoding Speed | Constraint |
|---|---|---|---|---|---|
| Llama-3.1-8B-Instruct | M1 Max Mac | Core ML / GPU | Apple-optimized | ~33 tokens/s | Memory bandwidth |
| Llama-3.1-8B-Instruct | M1 Max Mac | Core ML baseline | FP16, no KV cache | Lower (not reported) | Memory bandwidth + re-encode per step |
The ~33 tokens/s figure is the outcome of Core ML conversion plus Apple's additional optimization stages described in the walkthrough (KV cache integration, quantization, compute-unit tuning). Treat it as a validated upper reference for this hardware class, not a number you hit immediately from the baseline export.
Build the ExecuTorch deployment path from torch.export to .pte
Step 1: Install ExecuTorch
$ pip install executorch
ExecuTorch's modular design means the base install provides the export pipeline but not backend-specific lowering. For Apple Core ML backend support, follow the backend-specific setup in the ExecuTorch iOS/macOS docs.
Step 2: Export, lower, and serialize to .pte
import torch
import executorch.exir as exir
# wrapper defined in the static-shape section — same fixed-shape constraint applies
example_args = (dummy_input_ids, dummy_attention_mask)
# Step 1: Export to ATen dialect
exported_program = torch.export.export(wrapper, example_args)
# Step 2: Lower to edge dialect
edge_program = exir.to_edge_transform_and_lower(exported_program)
# Step 3: Convert to an ExecuTorch program and serialize it
executorch_program = edge_program.to_executorch()
with open("llama-3.1-8b-instruct.pte", "wb") as f:
f.write(executorch_program)
The official ExecuTorch export docs describe this sequence precisely: "Export the model using torch.export.export. Optimize the model for the target backend using to_edge_transform_and_lower. Create the .pte file by calling to_executorch and serializing the output."
The .pte artifact is paired with the ExecuTorch runtime at app integration time — the runtime itself is distributed as prebuilt .xcframework binary targets for ARM64 devices and simulators.
Check backend support before you pick ExecuTorch for iOS or macOS
ExecuTorch is supported on Apple silicon, but the quality of that support depends on whether your model's operator set lowers cleanly to the Core ML or MPS backend. The ExecuTorch desktop backends doc lists Core ML as "recommended for Apple Silicon" for macOS targets. The iOS path uses the same backends.
Watch Out: ExecuTorch advertises 12+ hardware backends covering Apple, Qualcomm, ARM, MediaTek, Vulkan, and more. A given quantized model may not lower cleanly to every backend. Unsupported operators during the
to_edge_transform_and_lowerstep can route subgraphs to the portable CPU kernel — this does not raise an error by default, but it does silently remove your hardware acceleration. Always inspect the lowering output to confirm what percentage of ops are delegated versus falling back. For Apple targets, simulator-based testing will not catch this — backend placement only resolves correctly on real Apple silicon hardware.
Simulator builds distribute via the same .xcframework targets as device builds, but runtime behavior can differ because the simulator runs under Rosetta or x86 translation on Intel hosts and lacks the physical ANE and GPU that device builds target.
Map runtime footprint and integration boundaries for app teams
Production Note: ExecuTorch's advertised base runtime footprint is 50KB — but this is the base footprint before you link backend libraries, kernels, and operator registrations. For an Apple-platform deployment using the Core ML backend, the practical binary size contribution is higher than 50KB once you account for the Core ML delegate library and any custom kernel bundles. App teams should treat 50KB as the floor for runtime negotiation with their app size budget, not the ceiling. Backend, kernels, and runtime are linked separately, so align package size, backend selection, and operator coverage before finalizing the integration plan.
Verify throughput, latency, and model fit on real devices
Apple's Core ML walkthrough reports ~33 tokens/s on M1 Max as the validated throughput for Llama-3.1-8B-Instruct after full optimization — and explicitly attributes the performance ceiling to memory bandwidth rather than compute: "the models like the Llama-3.1-8B-Instruct are usually constrained by memory bandwidth."
| Metric | Core ML (M1 Max, GPU) | ExecuTorch + Core ML backend | ANE-targeted (A14+/M1+) |
|---|---|---|---|
| Decode throughput | ~33 tokens/s (Apple reference) | Backend-dependent; not independently reported | Operator-dependent — higher with ANE-eligible ops |
| Memory behavior | Memory-bandwidth bottleneck at 8B scale | Same constraint if Core ML backend is used | Reduced memory consumption vs hybrid execution |
| Baseline export | No KV cache, fixed shapes | No KV cache, fixed shapes | Same static-shape constraints apply |
For ExecuTorch deployments using the Core ML backend, throughput should match Core ML figures when operator coverage is complete. When operators fall back to CPU kernels, throughput degrades materially — the gap is not incremental.
Measure the bottlenecks that simulators hide
Watch Out: Three failure modes are specific to Apple silicon deployments and invisible in simulators. First, unusual ops (custom attention variants, non-standard activations, ops from recent PyTorch releases not yet in coremltools) may fall back to CPU silently — benchmark on device and inspect the compiled artifact. Second, simulator runs under different compute constraints than device hardware; memory bandwidth behavior, in particular, does not replicate in the simulator. Third, ExecuTorch's
.xcframeworksupports both device and simulator targets, but backend selection can differ: the simulator may execute via a different code path than the device, masking lowering failures that only surface on ARM64 hardware. Always profile with Instruments on a real device before treating throughput numbers as validated.
Choose the deployment path that matches your product constraints
-
Choose Core ML when you target Apple platforms exclusively, want zero PyTorch runtime dependency in your app, need the most direct path to ANE or GPU acceleration, and can work within static-shape constraints. The Core ML compiler handles backend placement automatically, and the
.mlpackageformat integrates natively with Xcode and the system framework — no additional runtime packaging required. -
Choose ExecuTorch when your team maintains a single PyTorch-first export pipeline that targets Apple and non-Apple platforms (Android, embedded, Qualcomm), needs C++ or Swift runtime integration with explicit backend control, or wants to align on the
torch.exportexport primitive for forward compatibility with the PyTorch edge ecosystem. Account for the fact that ExecuTorch's Apple path uses Core ML or MPS backends under the hood — you get the same acceleration, with more pipeline complexity. -
Choose neither and re-evaluate your model shape when your operator set fails ANE placement on both paths. ANE acceleration is not universal — it requires static shapes, ANE-eligible operator coverage, and a chip generation that supports the relevant compute primitives (A14+ or M1+ for transformer-class models).
| Decision Factor | Core ML | ExecuTorch |
|---|---|---|
| Platform scope | Apple only | Apple + Android + embedded |
| Runtime dependency | None (system framework) | ExecuTorch runtime + backends |
| Export pipeline | coremltools conversion | torch.export → to_edge_transform_and_lower |
| ANE/GPU routing | Direct compiler control | Via Core ML or MPS backend delegate |
| Quantization integration | coremltools quantization API | ExecuTorch quantizer + backend validation |
| Operator fallback visibility | Xcode inspector | Manual lowering output inspection |
FAQ for on-device LLM deployment on Apple silicon
Can Core ML use the Apple Neural Engine?
Yes. Core ML's compiler routes eligible subgraphs to the ANE automatically when you specify compute_units=ct.ComputeUnit.ALL. Eligibility depends on operator type, input shape, and chip generation (A14+ or M1+ for transformer-scale workloads). Not all LLM operators route to the ANE — large linear projections and attention often land on GPU on M-series hardware because GPU offers better memory bandwidth at this scale.
Is ExecuTorch supported on Apple silicon?
Yes, on both iOS and macOS. The runtime ships as prebuilt .xcframework binaries for ARM64. ExecuTorch delegates Apple-platform acceleration to Core ML or MPS backends — so ExecuTorch support on Apple silicon is real, but it is mediated by those backends' operator coverage.
What is the difference between Core ML and ExecuTorch?
Core ML is a model format and system inference runtime — convert once, deploy as a first-class Apple artifact. ExecuTorch is a PyTorch-native edge-deployment pipeline that exports a .pte artifact and executes via a lightweight runtime with pluggable backends. On Apple platforms, ExecuTorch's acceleration comes from Core ML or MPS backends, not from a separate hardware pathway.
How do you convert a PyTorch model to Core ML?
Wrap the model to enforce static input shapes and disable dynamic features (KV cache in the baseline stage), trace with torch.jit.trace, then call ct.convert() with your deployment target and compute-unit preference. See the export code in the section above and the Apple on-device Llama walkthrough for the full multi-stage optimization path.
Does quantization improve LLM inference speed on Apple Neural Engine? Yes, when the quantized operators are accelerated by the backend. Int8 weight-only quantization is broadly safe with coremltools. Apple's internal models use low-bit palettization and 2-bit quantization-aware training. For third-party models, validate that the quantization scheme you choose lowers cleanly to the intended accelerator before measuring throughput — quantization that routes to a CPU fallback reduces speed rather than increasing it.
Pro Tip: Both paths are documented authoritatively by their maintainers. For Core ML, the Apple on-device Llama walkthrough covers the full multi-stage export, quantization, and KV-cache optimization sequence. For ExecuTorch, the export API docs and the iOS/macOS integration guide are the authoritative references. Both documents receive updates with new backend and quantization features — pin to the version matching your PyTorch 2.4.4 and coremltools installation.
Sources and references
- Apple Machine Learning Research: Core ML On-Device Llama — Primary source for Llama-3.1-8B-Instruct Core ML export walkthrough, ~33 tokens/s benchmark on M1 Max, static-shape constraints, and memory-bandwidth analysis
- Apple Machine Learning Research: Deploying Transformers on the Apple Neural Engine — ANE operator eligibility, chip-generation requirements (A14+/M1+), and throughput vs hybrid-execution trade-offs
- Apple Machine Learning Research: Introducing Apple Foundation Models — Low-bit palettization for on-device inference
- Apple Machine Learning Research: Apple Foundation Models Tech Report 2025 — 2-bit quantization-aware training for on-device Apple silicon models
- ExecuTorch GitHub README — 50KB base footprint, 12+ hardware backends, supported platform list
- ExecuTorch Export API Documentation —
torch.export.export,to_edge_transform_and_lower,.pteserialization pipeline - ExecuTorch iOS and macOS Integration Guide — Objective-C/Swift/C++ APIs, Core ML and MPS backend support on Apple platforms
- ExecuTorch Apple Runtime Docs —
.xcframeworkbinary targets for ARM64 devices and simulators - ExecuTorch Desktop Backends — Core ML as recommended Apple Silicon backend
- ExecuTorch Core ML Backend — Operator coverage, lowering behavior, and fallback handling for Apple deployments
Keywords: Core ML, coremltools, ExecuTorch, Apple Neural Engine, Llama-3.1-8B-Instruct, PyTorch 2.4.4, torch.export, to_edge_transform_and_lower, M1 Max, macOS 13, iOS 18, KV cache, static-shape tracing, int4 quantization, Apple silicon

