AI & ML

All about AI and Machine Learning, Latest articles, advances in domain.

All articles

AI & ML

Matryoshka representation learning for embeddings: how nested dimensions work in retrieval

Matryoshka representation learning trains embeddings so the prefix dimensions remain useful on their own — enabling truncation without retraining — but the trade-off is that lower dimensions preserve less signal, so the article must distinguish what the paper proves about truncation from what it does not prove about every downstream corpus.

19 min read

AI & ML

TensorRT-LLM large-scale expert parallelism: design choices for balancing MoE traffic

TensorRT-LLM’s large-scale expert parallelism adds online workload balancing and NVLink-aware communication kernels so MoE traffic can be redistributed dynamically across GPUs — but the architecture is tightly coupled to NVIDIA’s hardware and the load-balancing logic can trade lower imbalance for extra scheduling and communication complexity.

22 min read

AI & ML

BGE-M3 and BGE Reranker in 2026: what the benchmarks say about dense, lexical, and multi-vector retrieval

BGE-M3 is designed as a single model that unifies dense, lexical, and multi-vector/ColBERT-style retrieval across 100+ languages and long inputs up to 8192 tokens — but its benchmark story is only meaningful if you read it alongside the reranker, because the model card shows reranking and multi-retrieval are complementary rather than interchangeable.

32 min read

AI & ML

Should you offload KV cache to host memory in production inference stacks?

Offloading KV cache to host memory can raise effective concurrency when HBM is the bottleneck, but the article should frame it as a spend-shift decision: lower GPU-memory pressure and fewer OOMs versus higher TTFT and the hidden cost of extra system complexity, PCIe/NVLink traffic, and platform engineering time.

22 min read

AI & ML

How filtered vector search works under the hood

Filtered vector search is not one algorithm but a planner choice among pre-filtering, post-filtering, and inline-filtering: high-selectivity filters favor pre-filtering, low-selectivity filters favor post-filtering, and medium-selectivity filters can use inline strategies, but stale selectivity estimates can make the planner choose badly and hurt recall/latency.

24 min read

AI & ML

When does pgvector make sense instead of a dedicated vector database?

pgvector is the right default when you already run PostgreSQL and need vector search joined to relational data, but the cited guidance says dedicated vector databases become worth evaluating around 50M+ vectors or when you need extremely low latency or built-in hybrid search.

21 min read

AI & ML

Qdrant vs pgvector vs pgvectorscale for billion-vector filtering workloads

On a 50M-vector benchmark, pgvectorscale/Postgres delivered 11.4x higher throughput than Qdrant at 99% recall (471.57 QPS vs 41.47 QPS) while Qdrant kept lower tail latency, but the result is workload-dependent and the Tiger Data comparison notes index build speed and operational trade-offs still matter.

16 min read

AI & ML

Should you standardize on smaller embedding dimensions for RAG retrieval costs in 2026?

Smaller embedding dimensions can materially reduce vector storage and index cost — for large corpora the difference between 3072-dim float32 and compressed 1024-dim representations can exceed 100GB — but the savings only matter if your recall loss stays inside the business tolerance for the workload.

18 min read

AI & ML

MoE++ with zero-computation experts: how the routing and gating residuals work

MoE++ adds zero-computation experts (zero, copy, constant) so tokens can discard, skip, or replace the MoE path, while gating residuals inject the previous layer’s routing signal to stabilize expert selection — but the design only pays off when FFN experts are the real bottleneck and zero-cost experts are deployed locally on every GPU to avoid communication overhead.

21 min read

AI & ML

Should you ship on-device or in the cloud for compressed LLM inference on consumer hardware?

Hybrid edge-cloud routing can cut inference cost dramatically because local queries avoid API spend, latency, and data egress, but the business case only holds when the on-device model can service the majority of traffic — otherwise the infra and platform overhead wipe out the savings.

17 min read

AI & ML

Should enterprises migrate from naive RAG to modular or GraphRAG architectures?

Naive RAG is fast and cheap for localized FAQ-style queries, but GraphRAG and modular RAG become the better investment when questions require multi-hop reasoning, cross-document relationships, or stronger governance — the catch is that GraphRAG adds ontology/graph-maintenance overhead and can be slower to operate.

24 min read

AI & ML

How to enable FP8 KV cache quantization in vLLM without breaking prefix caching

vLLM’s FP8 KV cache can coexist with prefix caching because automatic prefix cache keys are still block-hash based — but on ROCm/W7900 the combination has a documented crash path, so the safe article must show the exact FP8 calibration path and the validation checks that prove prefix cache reuse still works.

18 min read

AI & ML

The weekly brief.