mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-20 07:15:01 +08:00
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com> Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Signed-off-by: herotai214 <herotai214@gmail.com> Signed-off-by: Khuong Le <khuong.le.manh@huawei.com> Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com> Co-authored-by: n00909098 <nguyen.kha.long@huawei.com> Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: herotai214 <herotai214@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Khuong Le <khuong.le.manh@huawei.com> Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
172 lines
4.8 KiB
Markdown
172 lines
4.8 KiB
Markdown
# EPD Correctness Test
|
|
|
|
This test verifies that EPD (Encoder-Prefill-Decode) disaggregation produces identical outputs to a baseline single instance.
|
|
|
|
## What It Tests
|
|
|
|
- **Baseline**: Single vLLM instance serving a multimodal model
|
|
- **EPD (1E+1PD)**: 1 Encoder + 1 Prefill-Decode instance
|
|
- **Baseline (1P+1D)**: 1 Prefill + 1 Decode instance
|
|
- **EPD (1E+1P+1D)**: 1 Encoder + 1 Prefill + 1 Decode instance
|
|
|
|
The test ensures that disaggregated encoding produces **identical** outputs to the baseline.
|
|
|
|
Note that currently PD disaggregation set up may give slightly different results from a single instance. Therefore, we need the result from 1P+1D as the baseline for 1E+1P+1D
|
|
|
|
Please refer to [Disaggregated Encoder Feature](../../../docs/features/disagg_encoder.md) for the detailed explanation for the EPD features.
|
|
|
|
## Files
|
|
|
|
- `run_epd_correctness_test.sh` - Main test script (starts all instances and runs tests)
|
|
- `test_epd_correctness.py` - Python test script (compares outputs)
|
|
|
|
## Usage
|
|
|
|
### Multimodal Prompts (Default)
|
|
|
|
```bash
|
|
cd vllm
|
|
./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
|
```
|
|
|
|
This runs the test with actual multimodal (image) prompts.
|
|
|
|
### Text-Only Prompts
|
|
|
|
```bash
|
|
cd vllm
|
|
USE_MM_PROMPTS=0 ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
|
```
|
|
|
|
This runs a quick test with text-only prompts to verify the setup works.
|
|
|
|
### Custom Configuration
|
|
|
|
```bash
|
|
# Use specific GPUs
|
|
GPU_E=0 GPU_PD=1 GPU_P=1 GPU_D=2 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
|
|
|
# Use specific ports
|
|
ENDPOINT_PORT=10001 bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
|
|
|
# Use specific model
|
|
MODEL="Qwen/Qwen2.5-VL-3B-Instruct" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
|
|
|
# Use specific storage path
|
|
EC_SHARED_STORAGE_PATH="/tmp/my_ec_cache" bash ./tests/v1/ec_connector/integration/run_epd_correctness_test.sh
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### Step 1: Baseline
|
|
|
|
1. Start single vLLM instance on GPU
|
|
2. Run test prompts (multimodal or text-only)
|
|
3. Save outputs to `.vllm_epd_baseline.txt`
|
|
4. Shutdown instance
|
|
|
|
### Step 2: EPD (1E + 1PD)
|
|
|
|
1. Clear encoder cache storage
|
|
2. Start instances and proxy
|
|
3. Run same test prompts
|
|
4. Assert outputs match baseline exactly
|
|
5. Shutdown instances
|
|
|
|
### Step 3: EPD (1E + 1P + 1D)
|
|
|
|
1. Clear encoder cache storage
|
|
2. Start instances and proxy
|
|
3. Run same test prompts
|
|
4. Assert outputs match baseline exactly
|
|
5. Shutdown instances
|
|
|
|
## Test Scenarios
|
|
|
|
### Multimodal Prompts (--use_mm_prompts)
|
|
|
|
Tests encoder cache transfer:
|
|
|
|
- Single image query
|
|
- Multiple images in one request
|
|
- Mixed image and text
|
|
- Image with detailed questions
|
|
|
|
### Text-Only Prompts (default)
|
|
|
|
Quick sanity check:
|
|
|
|
- Simple text queries
|
|
- Text-only explanations
|
|
- Verifies proxy routing works
|
|
|
|
## Expected Behavior
|
|
|
|
### ✅ Test Passes When
|
|
|
|
- All disagg outputs match baseline outputs exactly
|
|
- No errors during instance startup
|
|
- Encoder cache is properly saved and loaded
|
|
- Proxy correctly routes requests
|
|
|
|
### ❌ Test Fails When
|
|
|
|
- Outputs differ between baseline and disagg
|
|
- Server startup fails
|
|
- Encoder cache not found (should fallback to local execution)
|
|
- Proxy routing errors
|
|
|
|
## Notes
|
|
|
|
- The test uses deterministic generation (`temperature=0.0`, `seed=42`)
|
|
- Encoder cache should enable exact output reproduction
|
|
- Test cleans up all instances and cache files after completion
|
|
- Safe to run multiple times (idempotent)
|
|
- We setup the PD disagg part with NixlConnector. Please read details about EPD in `examples/online_serving/disaggregated_encoder/README.md`
|
|
|
|
## Requirements
|
|
|
|
- Multiple GPUs (3 for 1E+1P+1D, 2 for 1E+1PD, 1 for baseline)
|
|
- 1E+1P+1D is runnable with 2 GPU by assign E and P on the same GPU now.
|
|
- Multimodal model (e.g., Qwen2.5-VL-3B-Instruct)
|
|
- Internet access (for accessing vllm test images)
|
|
|
|
## Debugging
|
|
|
|
### Check Logs
|
|
|
|
Logs and baseline output are saved in `/tmp/` by default.
|
|
Can be customized by changing the environment variables.
|
|
|
|
### Check Encoder Cache
|
|
|
|
```bash
|
|
# Verify cache files are created
|
|
ls -la $EC_SHARED_STORAGE_PATH/
|
|
|
|
# Should see directories with mm_hash names
|
|
# Each containing encoder_cache.safetensors
|
|
```
|
|
|
|
### Manual Testing
|
|
|
|
Run individual components:
|
|
|
|
```bash
|
|
# Baseline only
|
|
python test_epd_correctness.py \
|
|
--service_url http://localhost:8000 \
|
|
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
|
|
--mode baseline \
|
|
--baseline_file test_output.txt \
|
|
--use_mm_prompts
|
|
|
|
# Disagg only (requires baseline output file!)
|
|
python test_epd_correctness.py \
|
|
--service_url http://localhost:8000 \
|
|
--model_name Qwen/Qwen2.5-VL-3B-Instruct \
|
|
--mode disagg \
|
|
--baseline_file test_output.txt \
|
|
--use_mm_prompts
|
|
```
|