mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 03:35:17 +08:00
Signed-off-by: Bram Wasti <bwasti@meta.com> Signed-off-by: Bram Wasti <bwasti@fb.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
134 lines
4.3 KiB
Markdown
134 lines
4.3 KiB
Markdown
# Batch Invariance
|
|
|
|
!!! note
|
|
Batch invariance is currently in beta. Some features are still under active development.
|
|
Track progress and planned improvements at <https://github.com/vllm-project/vllm/issues/27433>
|
|
|
|
This document shows how to enable batch invariance in vLLM. Batch invariance ensures that the output of a model is deterministic and independent of the batch size or the order of requests in a batch.
|
|
|
|
## Motivation
|
|
|
|
Batch invariance is crucial for several use cases:
|
|
|
|
- **Framework debugging**: Deterministic outputs make it easier to debug issues in the inference framework, as the same input will always produce the same output regardless of batching.
|
|
- **Model debugging**: Helps identify issues in model implementations by ensuring consistent behavior across different batch configurations.
|
|
- **Reinforcement Learning (RL)**: RL training often requires deterministic rollouts for reproducibility and stable training.
|
|
- **Large-scale inference systems**: Systems that use vLLM as a component benefit from deterministic behavior for testing, validation, and consistency guarantees.
|
|
|
|
## Hardware Requirements
|
|
|
|
Batch invariance currently requires NVIDIA GPUs with compute capability 9.0 or higher:
|
|
|
|
- **H-series**: H100, H200
|
|
- **B-series**: B100, B200
|
|
|
|
## Enabling Batch Invariance
|
|
|
|
Batch invariance can be enabled by setting the `VLLM_BATCH_INVARIANT` environment variable to `1`:
|
|
|
|
```bash
|
|
export VLLM_BATCH_INVARIANT=1
|
|
```
|
|
|
|
### Online Inference (Server Mode)
|
|
|
|
To start a vLLM server with batch invariance enabled:
|
|
|
|
```bash
|
|
VLLM_BATCH_INVARIANT=1 vllm serve meta-llama/Llama-3.1-8B-Instruct
|
|
```
|
|
|
|
Then use the OpenAI-compatible client:
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(
|
|
api_key="EMPTY",
|
|
base_url="http://localhost:8000/v1",
|
|
)
|
|
|
|
# These requests will produce deterministic outputs
|
|
# regardless of batch size or order
|
|
response = client.completions.create(
|
|
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
prompt="The future of AI is",
|
|
max_tokens=100,
|
|
temperature=0.7,
|
|
seed=42,
|
|
)
|
|
|
|
print(response.choices[0].text)
|
|
```
|
|
|
|
### Offline Inference
|
|
|
|
For offline batch inference with batch invariance:
|
|
|
|
```python
|
|
import os
|
|
os.environ["VLLM_BATCH_INVARIANT"] = "1"
|
|
|
|
from vllm import LLM, SamplingParams
|
|
|
|
prompts = [
|
|
"The future of AI is",
|
|
"Machine learning enables",
|
|
"Deep learning models can",
|
|
]
|
|
|
|
sampling_params = SamplingParams(
|
|
temperature=0.7,
|
|
top_p=0.95,
|
|
max_tokens=100,
|
|
seed=42,
|
|
)
|
|
|
|
llm = LLM(
|
|
model="meta-llama/Llama-3.1-8B-Instruct",
|
|
tensor_parallel_size=1,
|
|
)
|
|
|
|
# Outputs will be deterministic regardless of batch size
|
|
outputs = llm.generate(prompts, sampling_params)
|
|
|
|
for output in outputs:
|
|
prompt = output.prompt
|
|
generated_text = output.outputs[0].text
|
|
print(f"Prompt: {prompt!r}")
|
|
print(f"Generated: {generated_text!r}\n")
|
|
```
|
|
|
|
## Tested Models
|
|
|
|
Batch invariance has been tested and verified on the following models:
|
|
|
|
- **DeepSeek series**: `deepseek-ai/DeepSeek-V3`, `deepseek-ai/DeepSeek-V3-0324`, `deepseek-ai/DeepSeek-R1`, `deepseek-ai/DeepSeek-V3.1`
|
|
- **Qwen3 (Dense)**: `Qwen/Qwen3-1.7B`, `Qwen/Qwen3-8B`
|
|
- **Qwen3 (MoE)**: `Qwen/Qwen3-30B-A3B`, `Qwen/Qwen3-Next-80B-A3B-Instruct`
|
|
- **Llama 3**: `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`
|
|
|
|
Other models may also work, but these have been explicitly validated. If you encounter issues with a specific model, please report them on the [GitHub issue tracker](https://github.com/vllm-project/vllm/issues/new/choose).
|
|
|
|
## Implementation Details
|
|
|
|
When batch invariance is enabled, vLLM:
|
|
|
|
1. Uses deterministic kernel implementations for attention and other operations
|
|
2. Ensures consistent numerical behavior across different batch sizes
|
|
3. Disables certain optimizations that may introduce non-determinism (such as custom all-reduce operations in tensor parallel mode)
|
|
|
|
!!! note
|
|
Enabling batch invariance may impact performance compared to the default non-deterministic mode. This trade-off is intentional to guarantee reproducibility.
|
|
|
|
## Future Improvements
|
|
|
|
The batch invariance feature is under active development. Planned improvements include:
|
|
|
|
- Support for additional GPU architectures
|
|
- Expanded model coverage
|
|
- Performance optimizations
|
|
- Additional testing and validation
|
|
|
|
For the latest status and to contribute ideas, see the [tracking issue](https://github.com/vllm-project/vllm/issues/27433).
|