From 0e0a638c3b1e239ec4eaee5b4c15808768689eb0 Mon Sep 17 00:00:00 2001 From: Bram Wasti Date: Fri, 31 Oct 2025 17:22:19 -0400 Subject: [PATCH] Batch invariance doc (#27839) Signed-off-by: Bram Wasti Signed-off-by: Bram Wasti Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/features/batch_invariance.md | 133 ++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 docs/features/batch_invariance.md diff --git a/docs/features/batch_invariance.md b/docs/features/batch_invariance.md new file mode 100644 index 0000000000000..b196db9d9c25c --- /dev/null +++ b/docs/features/batch_invariance.md @@ -0,0 +1,133 @@ +# Batch Invariance + +!!! note + Batch invariance is currently in beta. Some features are still under active development. + Track progress and planned improvements at + +This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch. + +## Motivation + +Batch invariance is crucial for several use cases: + +- **Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching. +- **Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations. +- **Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training. +- **Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees. + +## Hardware Requirements + +Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher: + +- **H-series**: H100, H200 +- **B-series**: B100, B200 + +## Enabling Batch Invariance + +Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`: + +```bash +export VLLM_BATCH_INVARIANT=1 +``` + +### Online Inference (Server Mode) + +To start a vLLM server with batch invariance enabled: + +```bash +VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct +``` + +Then use the OpenAI-compatible client: + +```python +from openai import OpenAI + +client = OpenAI( + api_key="EMPTY", + base_url="http://localhost:8000/v1", +) + +# These requests will produce deterministic outputs +# regardless of batch size or order +response = client.completions.create( + model="meta-llama/Llama-3.1-8B-Instruct", + prompt="The future of AI is", + max_tokens=100, + temperature=0.7, + seed=42, +) + +print(response.choices[0].text) +``` + +### Offline Inference + +For offline batch inference with batch invariance: + +```python +import os +os.environ["VLLM_BATCH_INVARIANT"] = "1" + +from vllm import LLM, SamplingParams + +prompts = [ + "The future of AI is", + "Machine learning enables", + "Deep learning models can", +] + +sampling_params = SamplingParams( + temperature=0.7, + top_p=0.95, + max_tokens=100, + seed=42, +) + +llm = LLM( + model="meta-llama/Llama-3.1-8B-Instruct", + tensor_parallel_size=1, +) + +# Outputs will be deterministic regardless of batch size +outputs = llm.generate(prompts, sampling_params) + +for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}") + print(f"Generated: {generated_text!r}\n") +``` + +## Tested Models + +Batch invariance has been tested and verified on the following models: + +- **DeepSeek series**: `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-V3-0324`, `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1` +- **Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B` +- **Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Next-80B-A3B-Instruct` +- **Llama 3**: `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct` + +Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose). + +## Implementation Details + +When batch invariance is enabled, vLLM: + +1. Uses deterministic kernel implementations for attention and other operations +2. Ensures consistent numerical behavior across different batch sizes +3. Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode) + +!!! note + Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility. + +## Future Improvements + +The batch invariance feature is under active development. Planned improvements include: + +- Support for additional GPU architectures +- Expanded model coverage +- Performance optimizations +- Additional testing and validation + +For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm/issues/27433).