What Orion shows about Apple Neural Engine programmability
Bottom Line: The Orion paper (arXiv 2603.06728) demonstrates that the Apple Neural Engine can be programmed directly — bypassing Core ML — through private
_ANEClientand_ANECompilerAPIs. The system compiles MIL IR through five optimization passes to E5 microcode, manages tensor exchange via IOSurface-backed zero-copy I/O, and uses delta compilation to reduce per-step recompilation from 4,200 ms to 494 ms, yielding a 3.8× training speedup. The cost is a brittle dependency on private APIs, 20 documented MIL IR restrictions, and compile-time weight baking that breaks naive training loops.
The Orion paper is the most detailed public characterization of ANE programmability available as of mid-2026. Its contribution is not a faster inference backend in the conventional sense — it is a systems map of what the ANE actually enforces versus what Core ML politely hides from developers. Orion's runtime architecture exposes compile, eval, IOSurface I/O, program cache, zero-copy tensor exchange, and delta compilation as distinct, operable layers sitting above ANE private APIs and E5 microcode. That layering is exactly what practitioners need to understand before deciding whether to target the ANE directly or stay on the Core ML abstraction.
Why Core ML hides the ANE rather than exposing it directly
Core ML is a deliberate abstraction boundary, not a thin shim. The public Core ML compiler accepts models in .mlpackage format, selects a dispatch target (CPU, GPU, or ANE), and manages the compile-then-dispatch lifecycle without exposing which accelerator ran which op. That design choice protects Apple's ability to change the ANE microarchitecture across silicon generations without breaking developer binaries — but it removes per-op scheduling control, compilation timing visibility, and the ability to cache or delta-patch compiled programs directly.
As the Orion paper states, "The ANE operates on a compile-then-dispatch model. Programs are expressed in Apple's Model Intermediate Language (MIL), compiled to E5 microcode by _ANECompiler, and evaluated via _ANEClient." Core ML uses this same internal machinery but wraps it so that developers never interact with _ANECompiler or _ANEClient directly.
| Dimension | Core ML path | Private ANE path (Orion) |
|---|---|---|
| API stability | Public, versioned, supported | Private, undocumented, breakable per OS |
| Dispatch control | Framework-selected, opaque | Developer-explicit via _ANEClient |
| Compilation visibility | Hidden | Full access to compile latency and cache |
| Delta compilation | Not exposed | Orion implements directly |
| Zero-copy I/O | Framework-managed | IOSurface-backed, developer-managed |
| Weight mutability | Abstracted (re-export model) | Constrained by compile-time baking |
| Production risk | Low | High — API surface is private contract |
The private ANE path is not an officially supported alternative to Core ML. Orion's reliance on _ANEClient and _ANECompiler means any OS update can silently break the system. Core ML remains the only supported compiler/runtime boundary Apple provides developers.
Apple Neural Engine programming constraints from the MIL IR view
The most practically useful finding in Orion is its taxonomy of 20 restriction classes governing MIL IR programs that target ANE lowering. These span graph structure, operator support, memory layout, numerical behavior, and compile-time weight handling. Most of the pain in real model porting traces back to one or more of these restriction classes.
The single most consequential constraint is compile-time weight baking. As the Orion paper states, "Critically, the ANE bakes weights at compile time: weight tensors are embedded in the compiled program and cannot be mutated post-compilation." Any workload that updates parameters — fine-tuning, online learning, LoRA adapter swapping — requires recompilation of the affected program regions unless delta compilation handles the delta. At roughly 4,200 ms per full recompile, even a shallow fine-tuning loop becomes dominated by compiler overhead rather than compute.
MIL IR operations and graph-shape restrictions
Not every MIL operation survives lowering to ANE execution. Orion frames this explicitly as a compiler-lowering constraint rather than a hardware execution property — meaning an op may be theoretically computable on the ANE's MACs, but the lowering pass in _ANECompiler either lacks the pattern match or rejects the graph shape. Graph structure restrictions form a distinct category within Orion's 20 restriction classes, alongside operator-level restrictions.
Pro Tip: When a model fails to target the ANE, the first diagnostic question is whether the failure is a lowering constraint (the compiler cannot express this op in E5 microcode) or a hardware execution constraint (the op is expressible but exceeds a resource limit). Orion's framing treats these as separate classes. A lowering failure typically surfaces at compile time; a hardware execution limit may not. Conflating the two leads to wasted effort patching ops that the hardware could run if the lowering pass were more complete.
Orion's compiler addresses this by running five optimization passes over the graph IR before code generation, restructuring the graph into forms that the lowering pass can accept. Ops that cannot be restructured into a legal ANE form are rejected, which surfaces as a compilation failure rather than a runtime error.
Memory layout, tensor packing, and weight baking rules
Compile-time weight baking is incompatible with any workload that mutates parameters per step unless a delta-compilation strategy patches only the changed regions. The IOSurface-backed zero-copy tensor I/O path imposes its own layout constraints: because the runtime shares memory between the host process and the ANE runtime through IOSurface, tensor buffers must conform to the memory layout the ANE expects. Arbitrary tensor strides or non-contiguous layouts that work on CPU or GPU may not satisfy the shared-memory geometry requirements.
Memory layout restrictions are part of the 20 MIL IR restriction classes, which means layout problems surface during compilation, not during inference. A tensor that compiles successfully carries an implicit promise that its runtime layout will match — violating that promise produces incorrect results or crashes rather than a clean error.
Watch Out: Training loops that change weight shapes between steps — such as those that add adapter layers or resize projection matrices — will force full recompilation on each structural change. Even with delta compilation in place, structural shape changes fall outside the delta-reuse window and incur the full ~4,200 ms compile cost. Verify that your update strategy keeps graph structure and weight shapes constant if you need delta compilation to be effective.
Numerical behavior, dtype handling, and unsupported semantics
Numerical behavior and datatype handling form their own restriction category in Orion's 20-class taxonomy, separate from operator restrictions. This matters because a model can use only supported operators and still fail ANE lowering if those operators are parameterized with unsupported dtypes, precision modes, or numerical semantics.
Pro Tip: When porting models to ANE, dtype failures are among the most counterintuitive because they appear as compiler rejections on ops that are otherwise listed as supported. A
conv2dwithfloat32weights may lower correctly; the sameconv2dwithbfloat16intermediate tensors may not. Systematically testing each dtype variant of each op against the lowering pipeline — not just the op name — catches this class of failure before it blocks an entire porting effort.
Precision mismatches that go undetected at the MIL level can propagate silently into incorrect numerical outputs. Orion's paper treats numerical behavior restrictions as a separate class precisely because they are not always caught as hard compilation failures.
| Restriction class group | What it covers | Why it matters for ANE targeting | Practical impact |
|---|---|---|---|
| Graph structure | Shape-legal subgraphs, branch patterns, topology constraints | Determines whether _ANECompiler can lower the MIL IR at all |
Structural edits force recompilation or failure |
| Operator support | Individual op availability in the ANE lowering path | A graph can fail even if most ops are supported | Unsupported ops block direct execution |
| Memory layout | Tensor packing, strides, shared-buffer geometry | IOSurface-backed zero-copy I/O depends on compatible layout | Bad layouts fail at compile time or misbehave at runtime |
| Numerical behavior | Dtypes, precision modes, accumulation semantics | Supported ops can still fail on unsupported numeric variants | Requires dtype-by-dtype validation |
| Compile-time weight handling | Weight baking, mutation rules, patchability | Weights are embedded into the compiled program | Training requires delta compilation or full recompilation |
| Remaining Orion classes | Additional restrictions across lowering, resource use, and codegen | The 20-class taxonomy is broader than op support alone | Porting requires checking the whole constraint set, not one category |
Orion's compiler pipeline and delta compilation mechanism
Orion's compiler pipeline transforms the input graph IR into ANE-executable programs through five optimization passes before generating ANE-native MIL. The key design decision is that this pipeline does not just lower ops — it restructures the graph to fit the lowering constraints documented in Orion's 20-class taxonomy. Without the restructuring passes, a significantly larger fraction of model graphs would fail lowering outright.
The performance case for the pipeline centers on delta compilation. Naive training on the ANE fails economically because each weight update requires full recompilation: "Naïvely, this means every weight update during training requires full recompilation." At 4,200 ms per step, a 1,000-step fine-tuning run spends over an hour in the compiler. Delta compilation reduces per-step recompilation to 494 ms, bringing total compiler overhead for the same run down sharply. That is a calculation from the paper's measured step times, not a separately reported benchmark.
| Metric | Full recompilation | Delta compilation |
|---|---|---|
| Per-step compile time | ~4,200 ms | ~494 ms |
| Speedup vs. full recompile | 1× (baseline) | ~8.5× calculated from 4,200 ÷ 494 |
| Training speedup (end-to-end) | 1× (baseline) | 3.8× |
| Applicable to structural changes | Yes (always recompiles) | No — degrades to full recompile |
| Applicable to weight-only updates | Yes (wasteful) | Yes — primary use case |
The five optimization passes in Orion's lowering pipeline
Orion's abstract confirms the pipeline contains exactly five optimization passes: "Orion includes a compiler that lowers a graph IR through five optimization passes to ANE-native MIL." The available paper excerpt verifies the count but does not enumerate pass names in the abstract snippet; the table below reflects Orion's described functional stages of the lowering process rather than inventing internal names.
| Pass stage | Functional role |
|---|---|
| Graph normalization | Rewrites graph structure into forms the lowering pass accepts; eliminates patterns from the restriction classes |
| Operator fusion | Merges adjacent ops (e.g., conv + bias + activation) to reduce MIL instruction count and improve E5 dispatch efficiency |
| Memory layout assignment | Assigns tensor memory layouts compatible with IOSurface-backed zero-copy I/O and ANE packing requirements |
| Weight serialization | Embeds weight tensors into the compiled program; marks them as compile-time constants per the baking constraint |
| Code generation | Emits ANE-native MIL for dispatch to _ANECompiler → E5 microcode |
Pro Tip: The Core ML compiler performs structurally similar passes internally when lowering to ANE, but exposes none of the intermediate representation. Orion's value is making each pass inspectable — which means lowering failures can be attributed to a specific pass rather than diagnosed as an opaque Core ML rejection.
How delta compilation reuses compiled structure across training steps
Delta compilation preserves the compiled program structure across consecutive training steps and patches only the regions that changed — primarily updated weight tensors. Because weights are compile-time baked, the mechanism must identify which compiled regions correspond to which weights, track which weights changed, and re-emit only the affected compiled artifacts without invalidating the unchanged portions of the program cache.
The 4,200 ms → 494 ms reduction confirms that the dominant cost in per-step recompilation is the structural compilation work, not the weight serialization itself. Delta compilation eliminates most of that structural work when the graph topology and tensor shapes are stable between steps.
Pro Tip: Delta compilation yields its 8.5× per-step reduction specifically when the graph topology and weight shapes remain constant across steps — the typical case in SGD or Adam optimization. If training involves dynamic graph modifications (conditional blocks, variable-length sequences, architecture search), each modified region degrades toward full recompilation. Design training loops to isolate weight updates from structural changes to keep delta reuse rates high.
Runtime design: direct execution, program cache, and IOSurface zero-copy I/O
Orion's runtime sits between the compiler output and the ANE hardware, managing four distinct responsibilities: compilation, evaluation, tensor I/O, and program caching. The architecture surfaces these as separable layers, each with its own failure modes.
flowchart LR
subgraph Host["Host Process"]
GIR["Graph IR"]
PC["Program Cache"]
DC["Delta Compilation\nEngine"]
end
subgraph CoreRuntime["Orion Core Runtime"]
COMP["Compile\n(_ANECompiler)"]
EVAL["Eval\n(_ANEClient)"]
IO["IOSurface I/O\n(zero-copy)"]
CACHE["Program Cache\nLookup / Store"]
end
subgraph ANEStack["ANE Private Stack"]
ANEC["_ANECompiler"]
ACLNT["_ANEClient"]
E5["E5 Microcode\nExecution"]
end
GIR --> DC
DC --> COMP
PC <--> CACHE
COMP --> ANEC
ANEC --> CACHE
CACHE --> EVAL
EVAL --> ACLNT
ACLNT --> E5
IO -.->|"Shared IOSurface\nbuffer"| E5
E5 -.->|"Output tensor\n(zero-copy)"| IO
The program cache is the runtime's primary tool for amortizing compile cost. On the first execution of a program, compile latency dominates. Subsequent executions with unchanged program structure hit the cache and pay only dispatch overhead. Delta compilation feeds the cache by producing patched programs that are structurally compatible with cached entries.
Private API boundary: _ANEClient and _ANECompiler
Orion reaches the ANE by calling _ANECompiler to emit E5 microcode and _ANEClient to schedule execution. Both are private Apple frameworks — not documented, not ABI-stable, and not covered by any developer support contract. The paper confirms this directly: programs are "compiled to E5 microcode by _ANECompiler, and evaluated via _ANEClient."
The private API boundary is the mechanism Orion uses to bypass Core ML abstractions and access the compile-then-dispatch model directly. That same bypass is the production risk.
Pro Tip: Private API dependence means any macOS or iOS update can rename, restructure, or remove
_ANEClientand_ANECompilerwithout notice. Orion's approach is appropriate for research and for closed-system deployments where the OS version is frozen, but it is not a viable foundation for App Store distribution or any software that needs to survive an OTA update. The Core ML compiler remains the only path with a supported ABI contract.
IOSurface-backed zero-copy tensor I/O
IOSurface is a macOS/iOS kernel object that exposes a shared memory region to multiple processes through shared handles. Orion maps input and output tensors onto IOSurface-backed buffers so that the host process and the ANE runtime share the same physical memory — eliminating the copy that would otherwise occur when staging tensors for accelerator dispatch.
Zero-copy eliminates transfer overhead but does not eliminate coordination overhead. The host and ANE runtime must agree on which process owns the buffer at any point in time, and the host must not read output tensors until the ANE has signaled completion.
Watch Out: IOSurface-backed zero-copy designs shift responsibility for buffer lifetime, ownership, and synchronization to the caller. If the host process releases or overwrites an input buffer before the ANE dispatch completes, the result is undefined behavior — typically silent data corruption rather than a crash. If the host reads an output tensor before the ANE signals completion, it reads stale or partially-written data. Neither failure produces a clean error in the private API path; both require explicit synchronization barriers around every dispatch.
Program cache behavior and recompilation limits
The program cache stores compiled ANE programs keyed by their structural identity. A cache hit on repeated calls with the same program topology pays only dispatch overhead, which is orders of magnitude lower than compile latency. The full recompilation cost of ~4,200 ms applies only on cache misses — either the first call for a given program or any call where the program's structure has changed.
| Execution scenario | Cache state | Expected latency source |
|---|---|---|
| First run (new program) | Miss | Full compile: ~4,200 ms + dispatch |
| Repeated inference (static graph) | Hit | Dispatch only: sub-millisecond |
| Weight update (delta compilation) | Partial hit | Delta recompile: ~494 ms + dispatch |
| Shape change or structural edit | Miss | Full compile: ~4,200 ms + dispatch |
| OS update invalidates cache | Miss | Full compile on next run |
Workloads with high cache hit rates pay minimal compiler overhead. Workloads that frequently miss the cache — due to dynamic shapes, frequent structural edits, or OS-level cache invalidation — pay the full ~4,200 ms penalty and lose most of the ANE's execution efficiency advantage.
Where the runtime breaks on real models and training loops
The compiler and runtime design Orion describes performs well within a narrow operating envelope: static graph topology, stable weight shapes, no unusual numeric semantics, and updates handled via delta compilation. Outside that envelope, failure modes compound.
| Scenario | Per-step overhead | Root cause | Mitigation |
|---|---|---|---|
| Naive per-step training | ~4,200 ms | Full recompile per weight update | Delta compilation |
| Delta compilation (weight updates) | ~494 ms | Partial recompile of changed regions | Keep graph structure constant |
| Dynamic shapes (variable sequence length) | ~4,200 ms per shape change | Cache miss on shape mismatch | Pad to fixed shapes or bucket |
| Unsupported op in graph | Compile failure | MIL lowering restriction | Replace with supported op |
| Dtype mismatch | Compile failure or silent error | Numerical restriction class | Audit dtype chain before lowering |
| Structural graph edit per step | ~4,200 ms | Delta compilation cannot reuse | Freeze graph structure |
The 3.8× training speedup Orion reports assumes the delta-compilation path is active and the graph structure is stable. That speedup is real but conditional: it measures the gain over the naive full-recompile baseline, not against a GPU or CPU training baseline. Engineers evaluating whether ANE-side training is worthwhile need to compare 494 ms per step against their GPU step time — not against the unoptimized 4,200 ms ANE baseline.
| Benchmark | Measured value | Interpretation |
|---|---|---|
| Full recompilation per step | ~4,200 ms | Dominated by compile work, not compute |
| Delta compilation per step | ~494 ms | Reduced step cost when the graph topology and shapes are stable |
| End-to-end training speedup | 3.8× | Conditional on stable topology and delta reuse |
Compile-count ceilings, silent failures, and stateful runtime edge cases
The Orion paper's accessible sections characterize the structural failure modes of compile-time weight baking and cache invalidation without reporting a specific numeric limit on total compilations per process lifetime or session. No verified compile-count ceiling appears in the confirmed source text, so this article does not assert one.
Watch Out: The private API path through
_ANEClientand_ANECompilerdoes not offer the same error-surface guarantees as a public framework. Failures in the private path may surface as generic breakage rather than clean exceptions. Any deployment that drives repeated compilation in a long-running process should include explicit health checks on program cache state and dispatch return codes rather than relying on exception-based error handling.
Model classes that are practical versus fragile on ANE
The 20 MIL IR restriction classes define a boundary between model graphs that lower reliably and those that fail outright or degrade to full recompilation.
| Model class | ANE path stability | Primary risk |
|---|---|---|
| Static-shape CNN inference (e.g., image classification) | Stable | Unusual activations may fail lowering |
| Fixed-shape transformer inference (fixed seq len) | Stable with tuning | Attention pattern may need restructuring |
| LLM inference with fixed KV-cache shape | Moderately stable | Large model may exceed compile-time budget |
| LoRA / adapter fine-tuning (fixed adapter shape) | Viable with delta compilation | Adapter weight updates must not change shapes |
| Dynamic-shape NLP (variable sequence length) | Fragile | Each new shape is a cache miss |
| Architecture search / NAS training loops | Not viable | Structural changes per step defeat delta compilation |
Models with bfloat16 or custom numeric ops |
Risky | Dtype/numerical restriction classes |
Static-shape inference workloads with conventional operator sets are the safest target. Training workloads are viable only when Orion's delta-compilation conditions hold. Dynamic graphs — especially those used in beam search with variable output lengths or architecture search — are structurally incompatible with the compile-then-dispatch model at practical latencies.
Practical implications for constrained-device engineers
The ANE direct path is worth targeting when compile-time budget, graph stability, and API-stability risk all align. When any of those three conditions breaks, Core ML or GPU execution is the safer choice.
Choose the direct ANE path (Orion-style) when: - The graph is static-shape, operator-compliant per the 20 MIL IR restrictions, and ANE-targeted via Core ML already shows dispatch on the accelerator - Training involves only weight updates on a frozen graph topology, making delta compilation effective - The deployment environment is OS-locked (research system, controlled device fleet) and private API breakage is an acceptable maintenance cost - The 3.8× training speedup is a meaningful improvement over the GPU alternative on the same device
Choose the Core ML compiler path when: - The software ships through the App Store or must survive OS updates - The model uses unusual operators or mixed dtypes that intersect with MIL lowering restrictions - The team lacks the infrastructure to monitor private API breakage across OS releases - Core ML's automatic accelerator dispatch already achieves acceptable latency
Choose GPU (Metal) execution when: - Dynamic shapes or variable-length sequences make the ANE cache miss rate unacceptable - The model requires bf16 or fp32 accumulation semantics the ANE cannot guarantee - Training loop iteration speed is the binding constraint and the graph is too dynamic for delta compilation
Porting checklist for unusual ops, dynamic shapes, and training updates
Orion's 20 restriction classes are the primary checklist surface for any ANE porting project. Before investing in throughput optimization, verify the following:
- Lowering compliance: Confirm every op in the graph has a known ANE lowering path. Test each op individually against
_ANECompilerbefore testing the full graph. - Memory layout compatibility: Verify tensor memory layouts satisfy ANE packing requirements at every graph boundary, especially at IOSurface exchange points.
- Dtype chain audit: Trace the complete dtype chain through the graph. Every intermediate tensor's dtype must fall within the numerical restriction classes.
- Shape stability: Confirm that all tensor shapes — including intermediate activations, not just inputs and outputs — are static. Any dynamic shape forces a cache miss.
- Recompilation budget: Measure full compile latency for your graph. If it exceeds acceptable step time, test delta compilation with representative weight-update deltas before committing to the ANE path.
Pro Tip: Testing lowering compliance, memory layout compatibility, and recompilation behavior in that order catches failures at the cheapest point. Throughput optimization on a graph that fails lowering or breaks under realistic update patterns is wasted effort. Confirm the pipeline end-to-end with real model weights and real update sizes before benchmarking peak inference throughput.
What hardware and model profiles are most likely to benefit
Apple silicon classes from A11 Bionic onward include Neural Engine hardware, but the practical operating point for Orion-style direct programming is current Apple silicon where the ANE offers meaningful throughput.
| Workload profile | Hardware fit | Expected ANE benefit |
|---|---|---|
| Latency-sensitive static-shape inference | M4-class Mac | High — cache hits dominate, dispatch overhead low |
| Iterative weight updates (delta compilation) | M4-class Mac with locked OS | Moderate — 3.8× over naive recompile baseline |
| Large LLM inference (fixed KV-cache shape) | M4-class Mac | Moderate — depends on op coverage and memory fit |
| Dynamic-shape NLP / generative decoding | Any Apple silicon | Low — shape variability causes excessive cache misses |
| Large dynamic training graphs | Any Apple silicon | Not viable at current compile latencies |
The M4 Neural Engine's published 38 TOPS figure is a vendor marketing headline rather than a measured inference benchmark. Actual achieved throughput depends on op coverage, memory layout alignment, and compile-time budget — all of which are constrained by the 20 MIL IR restriction classes. Engineers should treat that headline as an upper bound that requires favorable graph conditions to approach.
Frequently asked questions about Apple Neural Engine internals
How does Apple Neural Engine work?
The ANE operates on a compile-then-dispatch model. Model programs are expressed in MIL IR, compiled to E5 microcode by _ANECompiler, and executed via _ANEClient. The hardware is a fixed-function accelerator optimized for matrix operations common in neural networks — convolutions, attention, linear layers. It does not execute arbitrary code; it runs compiled E5 microcode produced from legal MIL IR programs.
Can you run models directly on Apple Neural Engine without Core ML?
Yes, but only through private APIs. Orion demonstrates end-to-end direct execution via _ANEClient and _ANECompiler, bypassing the Core ML compiler entirely. This is not publicly supported, is not ABI-stable across OS releases, and is not appropriate for App Store distribution. Core ML is the only public, supported interface to ANE execution.
What is delta compilation in Apple Neural Engine?
Delta compilation, as implemented in Orion, identifies which regions of a compiled ANE program correspond to changed weights between training steps and recompiles only those regions — preserving the rest of the cached program structure. This reduces per-step recompilation from the full ~4,200 ms to ~494 ms, a reduction that produces a 3.8× end-to-end training speedup when graph topology and shapes are stable.
What are the limitations of Apple Neural Engine programming?
Orion characterizes 20 restriction classes covering operator support, graph structure, memory layout, numerical behavior, and compile-time weight baking. The most consequential single constraint is that weights are baked into the compiled program and cannot be mutated post-compilation, which forces recompilation for any parameter update. Unsupported ops, illegal graph shapes, dtype mismatches, and non-conforming memory layouts are additional failure classes.
Is Apple Neural Engine faster than CPU or GPU for inference?
For static-shape workloads with fully compliant MIL IR graphs, the ANE dispatches at sub-millisecond latency with high energy efficiency compared to CPU inference. Comparison against GPU (Metal) depends on model size, memory bandwidth, and op coverage. The ANE is not strictly faster than GPU for all workloads — it is more energy-efficient for workloads it can execute, but dynamic graphs or non-compliant op sets will execute faster on GPU due to the ANE's compile-time rigidity.
Sources & References
- Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference (arXiv 2603.06728) — Primary source for all ANE constraint taxonomy, delta compilation benchmarks, and runtime architecture findings cited in this article
- Orion paper (HTML full text, arXiv 2603.06728v1) — Source for
_ANEClient/_ANECompilerAPI descriptions and compile-then-dispatch model characterization - Orion paper (PDF, arXiv 2603.06728) — Source for compile-time weight baking quote and runtime architecture figure
- Apple Core ML documentation — Official reference for the public Core ML compiler and supported model interface
Keywords: Apple Neural Engine, Core ML, MIL IR, _ANEClient, _ANECompiler, IOSurface, M4 Neural Engine, A11 Bionic, Apple silicon, LLM inference, delta compilation, program cache, zero-copy tensor I/O, model lowering, training loop recompilation


