From 0e0a638c3b1e239ec4eaee5b4c15808768689eb0 Mon Sep 17 00:00:00 2001
From: Bram Wasti <bwasti@fb.com>
Date: Fri, 31 Oct 2025 17:22:19 -0400
Subject: [PATCH] Batch invariance doc (#27839)

Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---
 docs/features/batch_invariance.md | 133 ++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)
 create mode 100644 docs/features/batch_invariance.md

diff --git a/docs/features/batch_invariance.md b/docs/features/batch_invariance.md
new file mode 100644
index 0000000000000..b196db9d9c25c
--- /dev/null
+++ b/docs/features/batch_invariance.md
@@ -0,0 +1,133 @@
+# Batch Invariance
+
+!!! note
+    Batch invariance is currently in beta. Some features are still under active development.
+    Track progress and planned improvements at <https://github.com/vllm-project/vllm/issues/27433>
+
+This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.
+
+## Motivation
+
+Batch invariance is crucial for several use cases:
+
+- **Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.
+- **Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.
+- **Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training.
+- **Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.
+
+## Hardware Requirements
+
+Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher:
+
+- **H-series**: H100, H200
+- **B-series**: B100, B200
+
+## Enabling Batch Invariance
+
+Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`:
+
+```bash
+export VLLM_BATCH_INVARIANT=1
+```
+
+### Online Inference (Server Mode)
+
+To start a vLLM server with batch invariance enabled:
+
+```bash
+VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
+```
+
+Then use the OpenAI-compatible client:
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://localhost:8000/v1",
+)
+
+# These requests will produce deterministic outputs
+# regardless of batch size or order
+response = client.completions.create(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    prompt="The future of AI is",
+    max_tokens=100,
+    temperature=0.7,
+    seed=42,
+)
+
+print(response.choices[0].text)
+```
+
+### Offline Inference
+
+For offline batch inference with batch invariance:
+
+```python
+import os
+os.environ["VLLM_BATCH_INVARIANT"] = "1"
+
+from vllm import LLM, SamplingParams
+
+prompts = [
+    "The future of AI is",
+    "Machine learning enables",
+    "Deep learning models can",
+]
+
+sampling_params = SamplingParams(
+    temperature=0.7,
+    top_p=0.95,
+    max_tokens=100,
+    seed=42,
+)
+
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    tensor_parallel_size=1,
+)
+
+# Outputs will be deterministic regardless of batch size
+outputs = llm.generate(prompts, sampling_params)
+
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}")
+    print(f"Generated: {generated_text!r}\n")
+```
+
+## Tested Models
+
+Batch invariance has been tested and verified on the following models:
+
+- **DeepSeek series**: `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-V3-0324`, `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1`
+- **Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`
+- **Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Next-80B-A3B-Instruct`
+- **Llama 3**: `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`
+
+Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose).
+
+## Implementation Details
+
+When batch invariance is enabled, vLLM:
+
+1. Uses deterministic kernel implementations for attention and other operations
+2. Ensures consistent numerical behavior across different batch sizes
+3. Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode)
+
+!!! note
+    Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.
+
+## Future Improvements
+
+The batch invariance feature is under active development. Planned improvements include:
+
+- Support for additional GPU architectures
+- Expanded model coverage
+- Performance optimizations
+- Additional testing and validation
+
+For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm/issues/27433).