Skip to content
AxiomLogicaSearch
Category

AI & ML

All about AI and Machine Learning, Latest articles, advances in domain.

DeepSeek-V3 and the case for auxiliary-loss-free MoE benchmarks
AI & ML · Featured

DeepSeek-V3 and the case for auxiliary-loss-free MoE benchmarks

DeepSeek-V3 is benchmark-relevant not just because it is large, but because it combines auxiliary-loss-free load balancing, multi-token prediction, and FP8 training at scale — and MLCommons is now using it as a pretraining benchmark with a 671B/37B MoE reference setup, making it a meaningful test of modern sparse-training infrastructure rather than just another model card.

All articles

LongRoPE internals: how non-uniform RoPE rescaling reaches 2M tokens without retraining from scratch
AI & ML

LongRoPE internals: how non-uniform RoPE rescaling reaches 2M tokens without retraining from scratch

LongRoPE exploits two non-uniformities in RoPE interpolation — across RoPE dimensions and token positions — and uses an evolutionary search to find per-dimension, per-position rescaling factors, which enables an 8× non-finetuning extension and then a progressive 256k→2048k extension path — but it still needs short-context readjustment to recover original-window performance.

20 min read
DeepSeek-V3 auxiliary-loss-free load balancing: how the router changes MoE training
AI & ML

DeepSeek-V3 auxiliary-loss-free load balancing: how the router changes MoE training

DeepSeek-V3 replaces the usual router auxiliary loss with a dynamically adjusted per-expert bias term for load balancing — preserving the load-balancing goal while avoiding the performance degradation the paper attributes to heavy auxiliary losses, but the benefit is tied to sequence-wise balance and node-limited routing rather than eliminating imbalance entirely.

22 min read
YaRN vs LongRoPE vs dynamic NTK scaling: which context-extension method should you choose in 2026?
AI & ML

YaRN vs LongRoPE vs dynamic NTK scaling: which context-extension method should you choose in 2026?

LongRoPE pushes the ceiling to 2M tokens with a more complex search-and-progressive-extension pipeline, YaRN is validated in vLLM/Qwen deployment paths for practical length extrapolation, and dynamic NTK scaling is simpler to wire up — but the real trade-off is not raw maximum length alone; it is how much short-context regression, finetuning, and framework-specific friction you are willing to accept.

23 min read
FSDP vs DeepSpeed in Accelerate: how to choose sharding, offload, and checkpointing settings
AI & ML

FSDP vs DeepSpeed in Accelerate: how to choose sharding, offload, and checkpointing settings

Accelerate maps FSDP FULL_SHARD to DeepSpeed ZeRO stage 3, but the two stacks diverge on offload and checkpointing: FSDP is all-or-nothing for offload, while DeepSpeed can split parameter and optimizer offload and even target NVMe — but FSDP can checkpoint sharded state directly, whereas ZeRO-3 often needs a consolidation or post-conversion step, which changes the operational cost of saving 70B fine-tunes.

20 min read
How to extend a Llama or Qwen context window with YaRN in vLLM: a step-by-step deployment guide
AI & ML

How to extend a Llama or Qwen context window with YaRN in vLLM: a step-by-step deployment guide

vLLM’s Qwen deployment docs explicitly recommend RoPE scaling for context lengths beyond the pretrained 32,768-token limit and validate YaRN for length extrapolation — but the exact scaling knobs must be matched to the model’s original max position embeddings and sampling/runtime settings, or the model can silently degrade even if it accepts longer prompts.

18 min read
S-LoRA vs LoRAX vs vLLM PEFT: which multi-adapter serving stack fits your workload?
AI & ML

S-LoRA vs LoRAX vs vLLM PEFT: which multi-adapter serving stack fits your workload?

S-LoRA is optimized for high-scale multi-adapter serving through unified paging and heterogeneous batching, LoRAX is designed for thousands of adapters with dynamic loading and production features, and vLLM PEFT is the lighter-weight option when you want vLLM’s serving stack with adapter support but not the most aggressive multi-adapter specialization.

20 min read

The weekly brief.

One email each Sunday with what we tested, what we'd buy, and what to skip. No filler.