Grouped-query attention vs multi-query attention vs multi-head attention: KV-cache trade-offs for custom LLM architectures
GQA reduces KV-cache size by sharing K/V across query groups, which cuts inference memory bandwidth versus MHA while preserving more quality than MQA — but the right group size depends on the latency budget, context length, and whether the model must stay close to full multi-head capacity.