diff --git a/.buildkite/nightly-benchmarks/README.md b/.buildkite/nightly-benchmarks/README.md index ae42f70077cec..3721d3d1d6749 100644 --- a/.buildkite/nightly-benchmarks/README.md +++ b/.buildkite/nightly-benchmarks/README.md @@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc ## Trigger the benchmark Performance benchmark will be triggered when: + - A PR being merged into vllm. - Every commit for those PRs with `perf-benchmarks` label AND `ready` label. @@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh ``` Runtime environment variables: + - `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0. - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). @@ -46,12 +48,14 @@ Runtime environment variables: - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string. Nightly benchmark will be triggered when: + - Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label. ## Performance benchmark details See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases. > NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead. +> ### Latency test Here is an example of one test inside `latency-tests.json`: @@ -100,7 +104,6 @@ We test the throughput by using `vllm bench serve` with request rate = inf to co "tensor_parallel_size": 1, "swap_space": 16, "disable_log_stats": "", - "disable_log_requests": "", "load_format": "dummy" }, "client_parameters": { @@ -149,6 +152,7 @@ Here is an example using the script to compare result_a and result_b without det Here is an example using the script to compare result_a and result_b with detail test name. `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` + | | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio | |---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------| | 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 | diff --git a/.buildkite/nightly-benchmarks/nightly-annotation.md b/.buildkite/nightly-benchmarks/nightly-annotation.md index ef11c040057c8..466def07b6f1f 100644 --- a/.buildkite/nightly-benchmarks/nightly-annotation.md +++ b/.buildkite/nightly-benchmarks/nightly-annotation.md @@ -1,3 +1,4 @@ +# Nightly benchmark annotation ## Description @@ -13,15 +14,15 @@ Please download the visualization scripts in the post - Find the docker we use in `benchmarking pipeline` - Deploy the docker, and inside the docker: - - Download `nightly-benchmarks.zip`. - - In the same folder, run the following code: + - Download `nightly-benchmarks.zip`. + - In the same folder, run the following code: - ```bash - export HF_TOKEN= - apt update - apt install -y git - unzip nightly-benchmarks.zip - VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh - ``` + ```bash + export HF_TOKEN= + apt update + apt install -y git + unzip nightly-benchmarks.zip + VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh + ``` And the results will be inside `./benchmarks/results`. diff --git a/.buildkite/nightly-benchmarks/nightly-descriptions.md b/.buildkite/nightly-benchmarks/nightly-descriptions.md index 5f003f42f07c0..8afde017d383e 100644 --- a/.buildkite/nightly-benchmarks/nightly-descriptions.md +++ b/.buildkite/nightly-benchmarks/nightly-descriptions.md @@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/ ## Setup - Docker images: - - vLLM: `vllm/vllm-openai:v0.6.2` - - SGLang: `lmsysorg/sglang:v0.3.2-cu121` - - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` - - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` - - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* - - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. + - vLLM: `vllm/vllm-openai:v0.6.2` + - SGLang: `lmsysorg/sglang:v0.3.2-cu121` + - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` + - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` + - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* + - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. - Hardware - - 8x Nvidia A100 GPUs + - 8x Nvidia A100 GPUs - Workload: - - Dataset - - ShareGPT dataset - - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) - - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) - - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. - - Models: llama-3 8B, llama-3 70B. - - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). - - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. - - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. - - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). + - Dataset + - ShareGPT dataset + - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) + - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) + - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. + - Models: llama-3 8B, llama-3 70B. + - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). + - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. + - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. + - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). ## Known issues diff --git a/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md b/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md index a1f8441ccdac8..8bb16bd3cf373 100644 --- a/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md +++ b/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md @@ -1,3 +1,4 @@ +# Performance benchmarks descriptions ## Latency tests diff --git a/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py b/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py index 05623879c0c2c..554256b4bdb8b 100644 --- a/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py +++ b/.buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py @@ -44,6 +44,7 @@ serving_column_mapping = { "test_name": "Test name", "gpu_type": "GPU", "completed": "# of req.", + "max_concurrency": "# of max concurrency.", "request_throughput": "Tput (req/s)", "total_token_throughput": "Total Token Tput (tok/s)", "output_throughput": "Output Tput (tok/s)", diff --git a/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh b/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh index b515ee43934d1..2c57666a81aa3 100644 --- a/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh +++ b/.buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh @@ -33,7 +33,7 @@ check_gpus() { check_cpus() { # check the number of CPUs and NUMA Node and GPU type. - declare -g numa_count=$(python3 -c "from numa import info;numa_size = info.get_num_configured_nodes(); print(numa_size)") + declare -g numa_count=$(lscpu | grep "NUMA node(s):" | awk '{print $3}') if [[ $numa_count -gt 0 ]]; then echo "NUMA found." echo $numa_count diff --git a/.buildkite/nightly-benchmarks/tests/genai-perf-tests.json b/.buildkite/nightly-benchmarks/tests/genai-perf-tests.json index edbe9f2df0ce0..f26ae7634f3d9 100644 --- a/.buildkite/nightly-benchmarks/tests/genai-perf-tests.json +++ b/.buildkite/nightly-benchmarks/tests/genai-perf-tests.json @@ -11,7 +11,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, diff --git a/.buildkite/nightly-benchmarks/tests/nightly-tests.json b/.buildkite/nightly-benchmarks/tests/nightly-tests.json index fda1a7a3ec53c..41b4a4008801d 100644 --- a/.buildkite/nightly-benchmarks/tests/nightly-tests.json +++ b/.buildkite/nightly-benchmarks/tests/nightly-tests.json @@ -35,7 +35,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, @@ -90,7 +89,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, @@ -145,7 +143,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, @@ -197,7 +194,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, @@ -251,7 +247,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, @@ -305,7 +300,6 @@ }, "vllm_server_parameters": { "disable_log_stats": "", - "disable_log_requests": "", "gpu_memory_utilization": 0.9, "num_scheduler_steps": 10, "max_num_seqs": 512, diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json new file mode 100644 index 0000000000000..dd0e24edff98d --- /dev/null +++ b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc2.json @@ -0,0 +1,203 @@ +[ + { + "test_name": "serving_llama8B_tp1_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp2_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp4_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 4, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp1_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_tp2_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_tp4_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 4, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + } +] diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json new file mode 100644 index 0000000000000..f1bda65a7590b --- /dev/null +++ b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu-snc3.json @@ -0,0 +1,205 @@ +[ + { + "test_name": "serving_llama8B_pp1_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_pp3_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_tp2pp6_sharegpt", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "sharegpt", + "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", + "max_concurrency": 60, + "num_prompts": 200 + } + }, + { + "test_name": "serving_llama8B_pp1_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 1, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_pp3_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL:": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + }, + { + "test_name": "serving_llama8B_tp2pp3_random_128_128", + "qps_list": [1, 4, 16, "inf"], + "server_environment_variables": { + "VLLM_RPC_TIMEOUT": 100000, + "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, + "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, + "VLLM_CPU_KVCACHE_SPACE": 40 + }, + "server_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "tensor_parallel_size": 2, + "pipeline_parallel_size": 3, + "dtype": "bfloat16", + "distributed_executor_backend": "mp", + "block_size": 128, + "trust_remote_code": "", + "enable_chunked_prefill": "", + "disable_log_stats": "", + "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, + "load_format": "dummy" + }, + "client_parameters": { + "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "backend": "vllm", + "dataset_name": "random", + "random-input-len": 128, + "random-output-len": 128, + "ignore-eos": "", + "max_concurrency": 1000, + "num_prompts": 1000 + } + } +] diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json index 22f71c993ff33..f150b9abeea45 100644 --- a/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json +++ b/.buildkite/nightly-benchmarks/tests/serving-tests-cpu.json @@ -6,6 +6,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -16,8 +17,9 @@ "block_size": 128, "trust_remote_code": "", "disable_log_stats": "", - "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -36,6 +38,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -46,8 +49,9 @@ "block_size": 128, "trust_remote_code": "", "disable_log_stats": "", - "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -66,6 +70,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -76,8 +81,9 @@ "block_size": 128, "trust_remote_code": "", "disable_log_stats": "", - "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -96,6 +102,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -107,8 +114,9 @@ "trust_remote_code": "", "enable_chunked_prefill": "", "disable_log_stats": "", - "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { @@ -129,6 +137,7 @@ "VLLM_RPC_TIMEOUT": 100000, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, + "VLLM_CPU_SGL_KERNEL": 1, "VLLM_CPU_KVCACHE_SPACE": 40 }, "server_parameters": { @@ -140,8 +149,9 @@ "trust_remote_code": "", "enable_chunked_prefill": "", "disable_log_stats": "", - "disable_log_requests": "", "enforce_eager": "", + "max_num_batched_tokens": 2048, + "max_num_seqs": 256, "load_format": "dummy" }, "client_parameters": { diff --git a/.buildkite/nightly-benchmarks/tests/serving-tests.json b/.buildkite/nightly-benchmarks/tests/serving-tests.json index 13fd5aa8db97b..a6d4141d5c2dc 100644 --- a/.buildkite/nightly-benchmarks/tests/serving-tests.json +++ b/.buildkite/nightly-benchmarks/tests/serving-tests.json @@ -7,7 +7,6 @@ "tensor_parallel_size": 1, "swap_space": 16, "disable_log_stats": "", - "disable_log_requests": "", "load_format": "dummy" }, "client_parameters": { @@ -26,7 +25,6 @@ "tensor_parallel_size": 4, "swap_space": 16, "disable_log_stats": "", - "disable_log_requests": "", "load_format": "dummy" }, "client_parameters": { @@ -45,7 +43,6 @@ "tensor_parallel_size": 2, "swap_space": 16, "disable_log_stats": "", - "disable_log_requests": "", "load_format": "dummy" }, "client_parameters": { @@ -60,8 +57,7 @@ "test_name": "serving_llama70B_tp4_sharegpt_specdecode", "qps_list": [2], "server_parameters": { - "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", - "disable_log_requests": "", + "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", "tensor_parallel_size": 4, "swap_space": 16, "speculative_config": { diff --git a/.buildkite/scripts/hardware_ci/run-gh200-test.sh b/.buildkite/scripts/hardware_ci/run-gh200-test.sh index 8c64e14606d3b..f69e4b06680f5 100644 --- a/.buildkite/scripts/hardware_ci/run-gh200-test.sh +++ b/.buildkite/scripts/hardware_ci/run-gh200-test.sh @@ -16,8 +16,7 @@ DOCKER_BUILDKIT=1 docker build . \ --build-arg max_jobs=66 \ --build-arg nvcc_threads=2 \ --build-arg RUN_WHEEL_CHECK=false \ - --build-arg torch_cuda_arch_list="9.0+PTX" \ - --build-arg vllm_fa_cmake_gpu_arches="90-real" + --build-arg torch_cuda_arch_list="9.0+PTX" # Setup cleanup remove_docker_container() { docker rm -f gh200-test || true; } diff --git a/.buildkite/scripts/tpu/run_bm.sh b/.buildkite/scripts/tpu/run_bm.sh index beecaf7a740ae..b1e17b438578d 100755 --- a/.buildkite/scripts/tpu/run_bm.sh +++ b/.buildkite/scripts/tpu/run_bm.sh @@ -44,7 +44,6 @@ echo VLLM_USE_V1=1 vllm serve $MODEL \ --seed 42 \ - --disable-log-requests \ --max-num-seqs $MAX_NUM_SEQS \ --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \ --tensor-parallel-size $TENSOR_PARALLEL_SIZE \ diff --git a/.buildkite/test-pipeline.yaml b/.buildkite/test-pipeline.yaml index 948ce9e8667f5..e139c6b30586e 100644 --- a/.buildkite/test-pipeline.yaml +++ b/.buildkite/test-pipeline.yaml @@ -82,7 +82,7 @@ steps: - bash standalone_tests/python_only_compile.sh - label: Basic Correctness Test # 30min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] fast_check: true torch_nightly: true source_file_dependencies: @@ -99,7 +99,7 @@ steps: - VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py - label: Chunked Prefill Test - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/ - tests/basic_correctness/test_chunked_prefill @@ -108,7 +108,7 @@ steps: - VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py - label: Core Test # 10min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] fast_check: true source_file_dependencies: - vllm/core @@ -128,11 +128,10 @@ steps: - tests/entrypoints/offline_mode commands: - export VLLM_WORKER_MULTIPROC_METHOD=spawn - - pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py + - pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_collective_rpc.py - pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process - pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process - - VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process - VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests - label: Entrypoints Test (API Server) # 40min @@ -210,7 +209,7 @@ steps: - pytest -v -s distributed/test_eplb_execute.py - label: Metrics, Tracing Test # 10min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] num_gpus: 2 source_file_dependencies: - vllm/ @@ -229,7 +228,7 @@ steps: ##### 1 GPU test ##### - label: Regression Test # 5min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/ - tests/test_regression @@ -281,7 +280,7 @@ steps: - pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine - label: Examples Test # 25min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/examples" source_file_dependencies: - vllm/entrypoints @@ -306,7 +305,7 @@ steps: - VLLM_USE_V1=0 python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2 - label: Prefix Caching Test # 9min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/ - tests/prefix_caching @@ -315,7 +314,7 @@ steps: - label: Platform Tests (CUDA) - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/ - tests/cuda @@ -354,9 +353,10 @@ steps: - pytest -v -s compile/test_silu_mul_quant_fusion.py - pytest -v -s compile/test_sequence_parallelism.py - pytest -v -s compile/test_async_tp.py + - pytest -v -s compile/test_fusion_all_reduce.py - label: PyTorch Fullgraph Smoke Test # 9min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] torch_nightly: true source_file_dependencies: - vllm/ @@ -369,7 +369,7 @@ steps: - pytest -v -s compile/piecewise/test_full_cudagraph.py - label: PyTorch Fullgraph Test # 18min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] torch_nightly: true source_file_dependencies: - vllm/ @@ -378,7 +378,7 @@ steps: - pytest -v -s compile/test_full_graph.py - label: Kernels Core Operation Test - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - csrc/ - tests/kernels/core @@ -403,20 +403,21 @@ steps: - vllm/model_executor/layers/quantization - tests/kernels/quantization commands: - - pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT + - pytest -v -s kernels/quantization --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT parallelism: 2 -- label: Kernels MoE Test +- label: Kernels MoE Test %N mirror_hardwares: [amdexperimental] source_file_dependencies: - csrc/moe/ - tests/kernels/moe - vllm/model_executor/layers/fused_moe/ commands: - - pytest -v -s kernels/moe + - pytest -v -s kernels/moe --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT + parallelism: 2 - label: Kernels Mamba Test - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - csrc/mamba/ - tests/kernels/mamba @@ -424,7 +425,7 @@ steps: - pytest -v -s kernels/mamba - label: Tensorizer Test # 11min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] soft_fail: true source_file_dependencies: - vllm/model_executor/model_loader @@ -437,7 +438,7 @@ steps: - pytest -v -s entrypoints/openai/test_tensorizer_entrypoint.py - label: Model Executor Test - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/model_executor - tests/model_executor @@ -447,7 +448,7 @@ steps: - pytest -v -s model_executor - label: Benchmarks # 9min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/.buildkite" source_file_dependencies: - benchmarks/ @@ -455,7 +456,7 @@ steps: - bash scripts/run-benchmarks.sh - label: Benchmarks CLI Test # 10min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/ - tests/benchmarks/ @@ -494,7 +495,7 @@ steps: - pytest -s entrypoints/openai/correctness/ - label: Encoder Decoder tests # 5min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] source_file_dependencies: - vllm/ - tests/encoder_decoder @@ -502,7 +503,7 @@ steps: - pytest -v -s encoder_decoder - label: OpenAI-Compatible Tool Use # 20 min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] fast_check: false source_file_dependencies: - vllm/ @@ -580,7 +581,8 @@ steps: - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git - pip freeze | grep -E 'torch' - pytest -v -s models/multimodal/processing - - pytest -v -s --ignore models/multimodal/generation/test_whisper.py models/multimodal -m core_model + - pytest -v -s --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/test_tensor_schema.py models/multimodal -m core_model + - pytest -v -s models/multimodal/test_tensor_schema.py -m core_model # Needs mp_method="spawn" - cd .. && pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work - label: Multi-Modal Models Test (Extended) 1 @@ -623,7 +625,7 @@ steps: # This test is used only in PR development phase to test individual models and should never run on main - label: Custom Models Test - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] optional: true commands: - echo 'Testing custom models...' @@ -643,11 +645,40 @@ steps: - python3 examples/offline_inference/audio_language.py --model-type whisper - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl +- label: Blackwell Test + working_dir: "/vllm-workspace/" + gpu: b200 + # optional: true + source_file_dependencies: + - csrc/quantization/fp4/ + - csrc/attention/mla/ + - csrc/quantization/cutlass_w8a8/moe/ + - vllm/model_executor/layers/fused_moe/cutlass_moe.py + - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py + - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py + - vllm/v1/attention/backends/flashinfer.py + - vllm/compilation/fusion.py + commands: + - nvidia-smi + - python3 examples/offline_inference/basic/chat.py + # Attention + # num_heads2 broken by https://github.com/flashinfer-ai/flashinfer/issues/1353 + - pytest -v -s tests/kernels/attention/test_flashinfer.py -k 'not num_heads2' + - pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_attention.py + - pytest -v -s tests/kernels/test_cutlass_mla_decode.py + # Quantization + - pytest -v -s tests/kernels/quantization/test_cutlass_scaled_mm.py -k 'fp8' + - pytest -v -s tests/kernels/quantization/test_nvfp4_quant.py + - pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py + - pytest -v -s tests/kernels/moe/test_nvfp4_moe.py + # Fusion + - pytest -v -s tests/compile/test_fusion_all_reduce.py + ##### 1 GPU test ##### ##### multi gpus test ##### - label: Distributed Comm Ops Test # 7min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/tests" num_gpus: 2 source_file_dependencies: @@ -718,7 +749,6 @@ steps: # this test fails consistently. # TODO: investigate and fix - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py - - VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown - pytest -v -s models/multimodal/generation/test_maverick.py @@ -744,7 +774,7 @@ steps: - pytest -v -s plugins/lora_resolvers # unit tests for in-tree lora resolver plugins - label: Multi-step Tests (4 GPUs) # 36min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/tests" num_gpus: 4 source_file_dependencies: @@ -765,7 +795,7 @@ steps: - pytest -v -s multi_step/test_correctness_llm.py - label: Pipeline Parallelism Test # 45min - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] working_dir: "/vllm-workspace/tests" num_gpus: 4 source_file_dependencies: @@ -779,7 +809,7 @@ steps: - pytest -v -s distributed/test_pipeline_parallel.py - label: LoRA TP Test (Distributed) - mirror_hardwares: [amdexperimental, amdproduction] + mirror_hardwares: [amdexperimental] num_gpus: 4 source_file_dependencies: - vllm/lora @@ -792,6 +822,7 @@ steps: # requires multi-GPU testing for validation. - pytest -v -s -x lora/test_chatglm3_tp.py - pytest -v -s -x lora/test_llama_tp.py + - pytest -v -s -x lora/test_multi_loras_with_tp.py - label: Weight Loading Multiple GPU Test # 33min diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 2441055371663..5bc944296763d 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -10,7 +10,6 @@ /vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill /vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill /vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth -/vllm/model_executor/guided_decoding @mgoin @russellb @aarnphm /vllm/multimodal @DarkLight1337 @ywang96 /vllm/vllm_flash_attn @LucasWilkinson /vllm/lora @jeejeelee @@ -35,9 +34,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson /tests/distributed/test_pipeline_parallel.py @youkaichao /tests/distributed/test_same_node.py @youkaichao /tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo @aarnphm -/tests/entrypoints/llm/test_guided_generate.py @mgoin @russellb @aarnphm /tests/kernels @tlrmchlsmth @WoosukKwon -/tests/model_executor/test_guided_processors.py @mgoin @russellb /tests/models @DarkLight1337 @ywang96 /tests/multi_step @alexm-redhat @comaniac /tests/multimodal @DarkLight1337 @ywang96 @@ -64,3 +61,15 @@ mkdocs.yaml @hmellor /vllm/v1/worker/^xpu @jikunshang /vllm/platforms/xpu.py @jikunshang /docker/Dockerfile.xpu @jikunshang + +# Qwen-specific files +/vllm/attention/backends/dual_chunk_flash_attn.py @sighingnow +/vllm/model_executor/models/qwen* @sighingnow + +# Mistral-specific files +/vllm/model_executor/models/mistral*.py @patrickvonplaten +/vllm/model_executor/models/mixtral*.py @patrickvonplaten +/vllm/model_executor/models/voxtral*.py @patrickvonplaten +/vllm/model_executor/models/pixtral*.py @patrickvonplaten +/vllm/transformers_utils/configs/mistral.py @patrickvonplaten +/vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 017ec7ca82da7..d4aceab4472fa 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,5 @@ -## Essential Elements of an Effective PR Description Checklist +# Essential Elements of an Effective PR Description Checklist + - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results @@ -14,5 +15,4 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE B ## (Optional) Documentation Update - **BEFORE SUBMITTING, PLEASE READ ** (anything written below this line will be removed by GitHub Actions) diff --git a/.github/mergify.yml b/.github/mergify.yml index 5c878ac02069f..d8ae509e0ac30 100644 --- a/.github/mergify.yml +++ b/.github/mergify.yml @@ -149,9 +149,6 @@ pull_request_rules: - files=examples/offline_inference/structured_outputs.py - files=examples/online_serving/openai_chat_completion_structured_outputs.py - files=examples/online_serving/openai_chat_completion_structured_outputs_with_reasoning.py - - files~=^vllm/model_executor/guided_decoding/ - - files=tests/model_executor/test_guided_processors.py - - files=tests/entrypoints/llm/test_guided_generate.py - files~=^tests/v1/structured_output/ - files=tests/v1/entrypoints/llm/test_guided_generate.py - files~=^vllm/v1/structured_output/ diff --git a/.github/workflows/lint-and-deploy.yaml b/.github/workflows/lint-and-deploy.yaml index d5736c0aee208..2b1086b7faf43 100644 --- a/.github/workflows/lint-and-deploy.yaml +++ b/.github/workflows/lint-and-deploy.yaml @@ -2,12 +2,16 @@ name: Lint and Deploy Charts on: pull_request +concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true + permissions: contents: read jobs: lint-and-deploy: - runs-on: ubuntu-24.04-arm + runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 diff --git a/.github/workflows/matchers/markdownlint.json b/.github/workflows/matchers/markdownlint.json new file mode 100644 index 0000000000000..fe094a9badb25 --- /dev/null +++ b/.github/workflows/matchers/markdownlint.json @@ -0,0 +1,17 @@ +{ + "problemMatcher": [ + { + "owner": "markdownlint", + "pattern": [ + { + "regexp": "^([^:]*):(\\d+):?(\\d+)?\\s([\\w-\\/]*)\\s(.*)$", + "file": 1, + "line": 2, + "column": 3, + "code": 4, + "message": 5 + } + ] + } + ] +} \ No newline at end of file diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml index 8e694d18134ef..195579f206a2f 100644 --- a/.github/workflows/pre-commit.yml +++ b/.github/workflows/pre-commit.yml @@ -5,6 +5,10 @@ on: push: branches: [main] +concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: ${{ github.event_name == 'pull_request' }} + permissions: contents: read @@ -17,6 +21,7 @@ jobs: with: python-version: "3.12" - run: echo "::add-matcher::.github/workflows/matchers/actionlint.json" + - run: echo "::add-matcher::.github/workflows/matchers/markdownlint.json" - run: echo "::add-matcher::.github/workflows/matchers/mypy.json" - uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1 with: diff --git a/.github/workflows/scripts/build.sh b/.github/workflows/scripts/build.sh index 0f010832b465d..c69ebbb42da5a 100644 --- a/.github/workflows/scripts/build.sh +++ b/.github/workflows/scripts/build.sh @@ -15,7 +15,6 @@ $python_executable -m pip install -r requirements/build.txt -r requirements/cuda export MAX_JOBS=1 # Make sure release wheels are built for the following architectures export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX" -export VLLM_FA_CMAKE_GPU_ARCHES="80-real;90-real" bash tools/check_repo.sh diff --git a/.markdownlint.yaml b/.markdownlint.yaml new file mode 100644 index 0000000000000..c86fed9555d62 --- /dev/null +++ b/.markdownlint.yaml @@ -0,0 +1,13 @@ +MD007: + indent: 4 +MD013: false +MD024: + siblings_only: true +MD033: false +MD042: false +MD045: false +MD046: false +MD051: false +MD052: false +MD053: false +MD059: false diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5197820fb4020..612b290e88d46 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -35,12 +35,12 @@ repos: exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*' types_or: [c++, cuda] args: [--style=file, --verbose] -- repo: https://github.com/jackdewinter/pymarkdown - rev: v0.9.29 +- repo: https://github.com/igorshubovych/markdownlint-cli + rev: v0.45.0 hooks: - - id: pymarkdown + - id: markdownlint exclude: '.*\.inc\.md' - args: [fix] + stages: [manual] # Only run in CI - repo: https://github.com/rhysd/actionlint rev: v1.7.7 hooks: diff --git a/CMakeLists.txt b/CMakeLists.txt index 664fb6a0ee9f0..e2cc0ccdef515 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -243,6 +243,7 @@ set(VLLM_EXT_SRC "csrc/sampler.cu" "csrc/cuda_view.cu" "csrc/quantization/gptq/q_gemm.cu" + "csrc/quantization/compressed_tensors/int8_quant_kernels.cu" "csrc/quantization/fp8/common.cu" "csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu" "csrc/quantization/gguf/gguf_kernel.cu" @@ -296,8 +297,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") "csrc/sparse/cutlass/sparse_scaled_mm_entry.cu" "csrc/cutlass_extensions/common.cpp" "csrc/attention/mla/cutlass_mla_entry.cu" - "csrc/quantization/fp8/per_token_group_quant.cu" - "csrc/quantization/compressed_tensors/int8_quant_kernels.cu") + "csrc/quantization/fp8/per_token_group_quant.cu") set_gencode_flags_for_srcs( SRCS "${VLLM_EXT_SRC}" @@ -529,6 +529,25 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") endif() endif() + # The nvfp4_scaled_mm_sm120 kernels for Geforce Blackwell SM120 require + # CUDA 12.8 or later + cuda_archs_loose_intersection(FP4_ARCHS "12.0;12.0a" "${CUDA_ARCHS}") + if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND FP4_ARCHS) + set(SRCS + "csrc/quantization/fp4/nvfp4_quant_kernels.cu" + "csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu") + set_gencode_flags_for_srcs( + SRCS "${SRCS}" + CUDA_ARCHS "${FP4_ARCHS}") + list(APPEND VLLM_EXT_SRC "${SRCS}") + list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4_SM120=1") + message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}") + else() + message(STATUS "Not building NVFP4 as no compatible archs were found.") + # clear FP4_ARCHS + set(FP4_ARCHS) + endif() + # FP4 Archs and flags cuda_archs_loose_intersection(FP4_ARCHS "10.0a" "${CUDA_ARCHS}") if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND FP4_ARCHS) @@ -541,7 +560,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA") SRCS "${SRCS}" CUDA_ARCHS "${FP4_ARCHS}") list(APPEND VLLM_EXT_SRC "${SRCS}") - list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4=1") + list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4_SM100=1") list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1") message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}") else() diff --git a/README.md b/README.md index dc2f0afbe3538..5348405b72d2c 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,4 @@ +

@@ -16,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone --- *Latest News* 🔥 + - [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing). - [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/). - [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing). @@ -46,6 +48,7 @@ Easy, fast, and cheap LLM serving for everyone --- + ## About vLLM is a fast and easy-to-use library for LLM inference and serving. @@ -75,6 +78,7 @@ vLLM is flexible and easy to use with: - Multi-LoRA support vLLM seamlessly supports most popular open-source models on HuggingFace, including: + - Transformer-like LLMs (e.g., Llama) - Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3) - Embedding Models (e.g., E5-Mistral) @@ -91,6 +95,7 @@ pip install vllm ``` Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more. + - [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html) - [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) - [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) @@ -107,6 +112,7 @@ vLLM is a community project. Our compute resources for development and testing a Cash Donations: + - a16z - Dropbox - Sequoia Capital @@ -114,6 +120,7 @@ Cash Donations: - ZhenFund Compute Resources: + - AMD - Anyscale - AWS diff --git a/RELEASE.md b/RELEASE.md index 9352e7ef706c6..db0d51afc7be1 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -60,9 +60,10 @@ Please note: **No feature work allowed for cherry picks**. All PRs that are cons Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI. **Current Coverage:** + * Models: Llama3, Llama4, and Mixtral * Hardware: NVIDIA H100 and AMD MI300x -* *Note: Coverage may change based on new model releases and hardware availability* +* _Note: Coverage may change based on new model releases and hardware availability_ **Performance Validation Process:** @@ -71,11 +72,13 @@ Request write access to the [pytorch/pytorch-integration-testing](https://github **Step 2: Review Benchmark Setup** Familiarize yourself with the benchmark configurations: + * [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda) * [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm) **Step 3: Run the Benchmark** Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure: + * **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`) * **vLLM commit**: Set to the RC commit hash diff --git a/SECURITY.md b/SECURITY.md index 6053cfb41f35b..414669fb3712e 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -1,13 +1,45 @@ # Security Policy -## Reporting a Vulnerability +## Reporting security issues -If you believe you have found a security vulnerability in vLLM, we encourage you to let us know right away. We will investigate all legitimate reports and do our best to quickly fix the problem. +Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new). -Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new). Reports will then be triaged by the [vulnerability management team](https://docs.vllm.ai/en/latest/contributing/vulnerability_management.html). +## Issue triage ---- +Reports will then be triaged by the [vulnerability management team](https://docs.vllm.ai/en/latest/contributing/vulnerability_management.html). + +## Threat model Please see the [Security Guide in the vLLM documentation](https://docs.vllm.ai/en/latest/usage/security.html) for more information on vLLM's security assumptions and recommendations. Please see [PyTorch's Security Policy](https://github.com/pytorch/pytorch/blob/main/SECURITY.md) for more information and recommendations on how to securely interact with models. + +## Issue severity + +We will determine the risk of each issue, taking into account our experience dealing with past issues, versions affected, common defaults, and use cases. We use the following severity categories: + +### CRITICAL Severity + +Vulnerabilities that allow remote attackers to execute arbitrary code, take full control of the system, or significantly compromise confidentiality, integrity, or availability without any interaction or privileges needed, examples include remote code execution via network, deserialization issues that allow exploit chains. Generally those issues which are rated as CVSS ≥ 9.0. + +### HIGH Severity + +Serious security flaws that allow elevated impact—like RCE in specific, limited contexts or significant data loss—but require advanced conditions or some trust, examples include RCE in advanced deployment modes (e.g. multi-node), or high impact issues where some sort of privileged network access is required. These issues typically have CVSS scores between 7.0 and 8.9 + +### MODERATE Severity + +Vulnerabilities that cause denial of service or partial disruption, but do not allow arbitrary code execution or data breach and have limited impact. These issues have a CVSS rating between 4.0 and 6.9 + +### LOW Severity + +Minor issues such as informational disclosures, logging errors, non-exploitable flaws, or weaknesses that require local or high-privilege access and offer negligible impact. Examples include side channel attacks or hash collisions. These issues often have CVSS scores less than 4.0 + +## Prenotification policy + +For certain security issues of CRITICAL, HIGH, or MODERATE severity level, we may prenotify certain organizations or vendors that ship vLLM. The purpose of this prenotification is to allow for a coordinated release of fixes for severe issues. + +* This prenotification will be in the form of a private email notification. It may also include adding security contacts to the GitHub security advisory, typically a few days before release. + +* If you wish to be added to the prenotification group, please send an email copying all the members of the [vulnerability management team](https://docs.vllm.ai/en/latest/contributing/vulnerability_management.html). Each vendor contact will be analyzed on a case-by-case basis. + +* We may withdraw organizations from receiving future prenotifications if they release fixes or any other information about issues before they are public. Group membership may also change based on policy refinements for who may be included. diff --git a/benchmarks/README.md b/benchmarks/README.md index 3b10963c3e014..d6442a4fc3872 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive datasets supported on vLLM. It’s a living document, updated as new features and datasets become available. -**Dataset Overview** +## Dataset Overview @@ -81,16 +81,17 @@ become available. **Note**: HuggingFace dataset's `dataset-name` should be set to `hf` ---- +## 🚀 Example - Online Benchmark +
-🚀 Example - Online Benchmark +Show more
First start serving your model ```bash -vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests +vllm serve NousResearch/Hermes-3-Llama-3.1-8B ``` Then run the benchmarking script @@ -109,7 +110,7 @@ vllm bench serve \ If successful, you will see the following output -``` +```text ============ Serving Benchmark Result ============ Successful requests: 10 Benchmark duration (s): 5.78 @@ -133,11 +134,11 @@ P99 ITL (ms): 8.39 ================================================== ``` -**Custom Dataset** +### Custom Dataset If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl -``` +```json {"prompt": "What is the capital of India?"} {"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of China?"} @@ -145,7 +146,7 @@ If the dataset you want to benchmark is not supported yet in vLLM, even then you ```bash # start server -VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests +VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct ``` ```bash @@ -166,11 +167,11 @@ vllm bench serve --port 9001 --save-result --save-detailed \ You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`. -**VisionArena Benchmark for Vision Language Models** +### VisionArena Benchmark for Vision Language Models ```bash # need a model with vision capability here -vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests +vllm serve Qwen/Qwen2-VL-7B-Instruct ``` ```bash @@ -184,7 +185,7 @@ vllm bench serve \ --num-prompts 1000 ``` -**InstructCoder Benchmark with Speculative Decoding** +### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ @@ -201,13 +202,13 @@ vllm bench serve \ --num-prompts 2048 ``` -**Other HuggingFaceDataset Examples** +### Other HuggingFaceDataset Examples ```bash -vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests +vllm serve Qwen/Qwen2-VL-7B-Instruct ``` -**`lmms-lab/LLaVA-OneVision-Data`** +`lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench serve \ @@ -221,7 +222,7 @@ vllm bench serve \ --num-prompts 10 ``` -**`Aeala/ShareGPT_Vicuna_unfiltered`** +`Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench serve \ @@ -234,7 +235,7 @@ vllm bench serve \ --num-prompts 10 ``` -**`AI-MO/aimo-validation-aime`** +`AI-MO/aimo-validation-aime`: ``` bash vllm bench serve \ @@ -245,7 +246,7 @@ vllm bench serve \ --seed 42 ``` -**`philschmid/mt-bench`** +`philschmid/mt-bench`: ``` bash vllm bench serve \ @@ -255,7 +256,7 @@ vllm bench serve \ --num-prompts 80 ``` -**Running With Sampling Parameters** +### Running With Sampling Parameters When using OpenAI-compatible backends such as `vllm`, optional sampling parameters can be specified. Example client command: @@ -273,25 +274,29 @@ vllm bench serve \ --num-prompts 10 ``` -**Running With Ramp-Up Request Rate** +### Running With Ramp-Up Request Rate The benchmark tool also supports ramping up the request rate over the duration of the benchmark run. This can be useful for stress testing the server or finding the maximum throughput that it can handle, given some latency budget. Two ramp-up strategies are supported: + - `linear`: Increases the request rate linearly from a start value to an end value. - `exponential`: Increases the request rate exponentially. The following arguments can be used to control the ramp-up: + - `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`). - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark. - `--ramp-up-end-rps`: The request rate at the end of the benchmark.
+## 📈 Example - Offline Throughput Benchmark +
-📈 Example - Offline Throughput Benchmark +Show more
@@ -305,15 +310,15 @@ vllm bench throughput \ If successful, you will see the following output -``` +```text Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s Total num prompt tokens: 5014 Total num output tokens: 1500 ``` -**VisionArena Benchmark for Vision Language Models** +### VisionArena Benchmark for Vision Language Models -``` bash +```bash vllm bench throughput \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ @@ -325,13 +330,13 @@ vllm bench throughput \ The `num prompt tokens` now includes image token counts -``` +```text Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s Total num prompt tokens: 14527 Total num output tokens: 1280 ``` -**InstructCoder Benchmark with Speculative Decoding** +### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_WORKER_MULTIPROC_METHOD=spawn \ @@ -349,15 +354,15 @@ vllm bench throughput \ "prompt_lookup_min": 2}' ``` -``` +```text Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s Total num prompt tokens: 261136 Total num output tokens: 204800 ``` -**Other HuggingFaceDataset Examples** +### Other HuggingFaceDataset Examples -**`lmms-lab/LLaVA-OneVision-Data`** +`lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench throughput \ @@ -370,7 +375,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**`Aeala/ShareGPT_Vicuna_unfiltered`** +`Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench throughput \ @@ -382,7 +387,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**`AI-MO/aimo-validation-aime`** +`AI-MO/aimo-validation-aime`: ```bash vllm bench throughput \ @@ -394,7 +399,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**Benchmark with LoRA Adapters** +Benchmark with LoRA adapters: ``` bash # download dataset @@ -413,20 +418,22 @@ vllm bench throughput \
+## 🛠️ Example - Structured Output Benchmark +
-🛠️ Example - Structured Output Benchmark +Show more
Benchmark the performance of structured output generation (JSON, grammar, regex). -**Server Setup** +### Server Setup ```bash -vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests +vllm serve NousResearch/Hermes-3-Llama-3.1-8B ``` -**JSON Schema Benchmark** +### JSON Schema Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Grammar-based Generation Benchmark** +### Grammar-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Regex-based Generation Benchmark** +### Regex-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Choice-based Generation Benchmark** +### Choice-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**XGrammar Benchmark Dataset** +### XGrammar Benchmark Dataset ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \
+## 📚 Example - Long Document QA Benchmark +
-📚 Example - Long Document QA Benchmark +Show more
Benchmark the performance of long document question-answering with prefix caching. -**Basic Long Document QA Test** +### Basic Long Document QA Test ```bash python3 benchmarks/benchmark_long_document_qa_throughput.py \ @@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \ --repeat-count 5 ``` -**Different Repeat Modes** +### Different Repeat Modes ```bash # Random mode (default) - shuffle prompts randomly @@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
+## 🗂️ Example - Prefix Caching Benchmark +
-🗂️ Example - Prefix Caching Benchmark +Show more
Benchmark the efficiency of automatic prefix caching. -**Fixed Prompt with Prefix Caching** +### Fixed Prompt with Prefix Caching ```bash python3 benchmarks/benchmark_prefix_caching.py \ @@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \ --input-length-range 128:256 ``` -**ShareGPT Dataset with Prefix Caching** +### ShareGPT Dataset with Prefix Caching ```bash # download dataset @@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \
+## ⚡ Example - Request Prioritization Benchmark +
-⚡ Example - Request Prioritization Benchmark +Show more
Benchmark the performance of request prioritization in vLLM. -**Basic Prioritization Test** +### Basic Prioritization Test ```bash python3 benchmarks/benchmark_prioritization.py \ @@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \ --scheduling-policy priority ``` -**Multiple Sequences per Prompt** +### Multiple Sequences per Prompt ```bash python3 benchmarks/benchmark_prioritization.py \ diff --git a/benchmarks/auto_tune/README.md b/benchmarks/auto_tune/README.md index c479ff1aa29c0..9aad51df6e003 100644 --- a/benchmarks/auto_tune/README.md +++ b/benchmarks/auto_tune/README.md @@ -3,6 +3,7 @@ This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate. ## Table of Contents + - [Prerequisites](#prerequisites) - [Configuration](#configuration) - [How to Run](#how-to-run) @@ -52,7 +53,7 @@ You must set the following variables at the top of the script before execution. 1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section. 2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost. -``` +```bash cd bash auto_tune.sh ``` @@ -64,6 +65,7 @@ bash auto_tune.sh Here are a few examples of how to configure the script for different goals: ### 1. Maximize Throughput (No Latency Constraint) + - **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens. - **Configuration**: @@ -76,6 +78,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number ``` #### 2. Maximize Throughput with a Latency Requirement + - **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms. - **Configuration**: @@ -88,6 +91,7 @@ MAX_LATENCY_ALLOWED_MS=500 ``` #### 3. Maximize Throughput with Prefix Caching and Latency Requirements + - **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms. - **Configuration**: @@ -109,7 +113,7 @@ After the script finishes, you will find the results in a new, timestamped direc - **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found. -``` +```text # Example result.txt content hash:a1b2c3d4... max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8 diff --git a/benchmarks/auto_tune/auto_tune.sh b/benchmarks/auto_tune/auto_tune.sh index 3cd8580e065dd..82c20ffa6554c 100644 --- a/benchmarks/auto_tune/auto_tune.sh +++ b/benchmarks/auto_tune/auto_tune.sh @@ -49,6 +49,7 @@ best_throughput=0 best_max_num_seqs=0 best_num_batched_tokens=0 best_goodput=0 +best_request_rate=0 start_server() { local gpu_memory_utilization=$1 @@ -57,19 +58,35 @@ start_server() { local vllm_log=$4 local profile_dir=$5 - pkill -f vllm + pkill -if vllm - VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir vllm serve $MODEL \ - --disable-log-requests \ - --port 8004 \ - --gpu-memory-utilization $gpu_memory_utilization \ - --max-num-seqs $max_num_seqs \ - --max-num-batched-tokens $max_num_batched_tokens \ - --tensor-parallel-size $TP \ - --enable-prefix-caching \ - --load-format dummy \ - --download-dir "$DOWNLOAD_DIR" \ - --max-model-len $MAX_MODEL_LEN > "$vllm_log" 2>&1 & + # Define the common arguments as a bash array. + # Each argument and its value are separate elements. + local common_args_array=( + "$MODEL" + "--disable-log-requests" + "--port" "8004" + "--gpu-memory-utilization" "$gpu_memory_utilization" + "--max-num-seqs" "$max_num_seqs" + "--max-num-batched-tokens" "$max_num_batched_tokens" + "--tensor-parallel-size" "$TP" + "--enable-prefix-caching" + "--load-format" "dummy" + "--download-dir" "$DOWNLOAD_DIR" + "--max-model-len" "$MAX_MODEL_LEN" + ) + + # Use the array expansion "${common_args_array[@]}" + # This correctly passes each element as a separate argument. + if [[ -n "$profile_dir" ]]; then + # Start server with profiling enabled + VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 VLLM_TORCH_PROFILER_DIR=$profile_dir \ + vllm serve "${common_args_array[@]}" > "$vllm_log" 2>&1 & + else + # Start server without profiling + VLLM_USE_V1=1 VLLM_SERVER_DEV_MODE=1 \ + vllm serve "${common_args_array[@]}" > "$vllm_log" 2>&1 & + fi # wait for 10 minutes... server_started=0 @@ -83,6 +100,7 @@ start_server() { sleep 10 fi done + if (( ! server_started )); then echo "server did not start within 10 minutes. Please check server log at $vllm_log". return 1 @@ -91,37 +109,20 @@ start_server() { fi } -update_best_profile() { - local profile_dir=$1 - local profile_index=$2 - sorted_paths=($(find "$profile_dir" -maxdepth 1 -not -path "$profile_dir" | sort)) - selected_profile_file= - if [[ "$SYSTEM" == "TPU" ]]; then - selected_profile_file="${sorted_paths[$profile_index]}/*.xplane.pb" - fi - if [[ "$SYSTEM" == "GPU" ]]; then - selected_profile_file="${sorted_paths[$profile_index]}" - fi - rm -f $PROFILE_PATH/* - cp $selected_profile_file $PROFILE_PATH -} - run_benchmark() { local max_num_seqs=$1 local max_num_batched_tokens=$2 local gpu_memory_utilization=$3 echo "max_num_seq: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens" local vllm_log="$LOG_FOLDER/vllm_log_${max_num_seqs}_${max_num_batched_tokens}.txt" - local profile_dir="$LOG_FOLDER/profile_${max_num_seqs}_${max_num_batched_tokens}" echo "vllm_log: $vllm_log" echo rm -f $vllm_log - mkdir -p $profile_dir - pkill -f vllm - local profile_index=0 + pkill -if vllm echo "starting server..." - start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log $profile_dir + # Call start_server without a profile_dir to avoid profiling overhead + start_server $gpu_memory_utilization $max_num_seqs $max_num_batched_tokens $vllm_log "" result=$? if [[ "$result" -eq 1 ]]; then echo "server failed to start. gpu_memory_utilization:$gpu_memory_utilization, max_num_seqs:$max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens" @@ -135,7 +136,8 @@ run_benchmark() { # get a basic qps by using request-rate inf bm_log="$LOG_FOLDER/bm_log_${max_num_seqs}_${max_num_batched_tokens}_requestrate_inf.txt" prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 )) -adjusted_input_len=$(( INPUT_LEN - prefix_len )) + adjusted_input_len=$(( INPUT_LEN - prefix_len )) + # --profile flag is removed from this call vllm bench serve \ --backend vllm \ --model $MODEL \ @@ -149,8 +151,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len )) --goodput e2el:$MAX_LATENCY_ALLOWED_MS \ --num-prompts 1000 \ --random-prefix-len $prefix_len \ - --port 8004 \ - --profile &> "$bm_log" + --port 8004 &> "$bm_log" throughput=$(grep "Request throughput (req/s):" "$bm_log" | sed 's/[^0-9.]//g') e2el=$(grep "P99 E2EL (ms):" "$bm_log" | awk '{print $NF}') goodput=$(grep "Request goodput (req/s):" "$bm_log" | sed 's/[^0-9.]//g') @@ -164,7 +165,6 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len )) # start from request-rate as int(throughput) + 1 request_rate=$((${throughput%.*} + 1)) while ((request_rate > 0)); do - profile_index=$((profile_index+1)) # clear prefix cache curl -X POST http://0.0.0.0:8004/reset_prefix_cache sleep 5 @@ -202,12 +202,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len )) best_max_num_seqs=$max_num_seqs best_num_batched_tokens=$max_num_batched_tokens best_goodput=$goodput - if [[ "$SYSTEM" == "TPU" ]]; then - update_best_profile "$profile_dir/plugins/profile" $profile_index - fi - if [[ "$SYSTEM" == "GPU" ]]; then - update_best_profile "$profile_dir" $profile_index - fi + best_request_rate=$request_rate fi else echo "max_num_seqs: $max_num_seqs, max_num_batched_tokens: $max_num_batched_tokens does not meet latency requirement ${MAX_LATENCY_ALLOWED_MS}" @@ -216,7 +211,7 @@ adjusted_input_len=$(( INPUT_LEN - prefix_len )) echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput" - pkill vllm + pkill -if vllm sleep 10 printf '=%.0s' $(seq 1 20) return 0 @@ -229,7 +224,8 @@ read -r -a num_batched_tokens_list <<< "$NUM_BATCHED_TOKENS_LIST" gpu_memory_utilization=0.98 find_gpu_memory_utilization=0 while (( $(echo "$gpu_memory_utilization >= 0.9" | bc -l) )); do - start_server $gpu_memory_utilization "${num_seqs_list[-1]}" "${num_batched_tokens_list[-1]}" "$LOG_FOLDER/vllm_log_gpu_memory_utilization_$gpu_memory_utilization.log" + # Pass empty string for profile_dir argument + start_server $gpu_memory_utilization "${num_seqs_list[-1]}" "${num_batched_tokens_list[-1]}" "$LOG_FOLDER/vllm_log_gpu_memory_utilization_$gpu_memory_utilization.log" "" result=$? if [[ "$result" -eq 0 ]]; then find_gpu_memory_utilization=1 @@ -252,5 +248,45 @@ for num_seqs in "${num_seqs_list[@]}"; do done done echo "finish permutations" + +# ================================================================================= +# FINAL PROFILING RUN FOR THE BEST CONFIGURATION +# ================================================================================= +if (( $(echo "$best_throughput > 0" | bc -l) )); then + echo + echo "Benchmark tuning finished. Now running profiling on the best configuration found..." + echo "Best config: max_num_seqs: $best_max_num_seqs, max_num_batched_tokens: $best_num_batched_tokens, throughput: $best_throughput" + echo + + vllm_log="$LOG_FOLDER/vllm_log_BEST_PROFILE.txt" + bm_log="$LOG_FOLDER/bm_log_BEST_PROFILE.txt" + + # Start server with the best params and profiling ENABLED + echo "Starting server for profiling..." + start_server $gpu_memory_utilization $best_max_num_seqs $best_num_batched_tokens "$vllm_log" "$PROFILE_PATH" + + # Run benchmark with the best params and the --profile flag + echo "Running benchmark with profiling..." + prefix_len=$(( INPUT_LEN * MIN_CACHE_HIT_PCT / 100 )) + adjusted_input_len=$(( INPUT_LEN - prefix_len )) + vllm bench serve \ + --backend vllm \ + --model $MODEL \ + --dataset-name random \ + --random-input-len $adjusted_input_len \ + --random-output-len $OUTPUT_LEN \ + --ignore-eos \ + --disable-tqdm \ + --request-rate $best_request_rate \ + --percentile-metrics ttft,tpot,itl,e2el \ + --goodput e2el:$MAX_LATENCY_ALLOWED_MS \ + --num-prompts 100 \ + --random-prefix-len $prefix_len \ + --port 8004 \ + --profile &> "$bm_log" +else + echo "No configuration met the latency requirements. Skipping final profiling run." +fi +pkill -if vllm echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" echo "best_max_num_seqs: $best_max_num_seqs, best_num_batched_tokens: $best_num_batched_tokens, best_throughput: $best_throughput, profile saved in: $PROFILE_PATH" >> "$RESULT" diff --git a/benchmarks/benchmark_serving.py b/benchmarks/benchmark_serving.py index 53bd3247afbb6..93b72211eb332 100644 --- a/benchmarks/benchmark_serving.py +++ b/benchmarks/benchmark_serving.py @@ -5,8 +5,7 @@ r"""Benchmark online serving throughput. On the server side, run one of the following commands: vLLM OpenAI API server vllm serve \ - --swap-space 16 \ - --disable-log-requests + --swap-space 16 On the client side, run: python benchmarks/benchmark_serving.py \ @@ -413,6 +412,10 @@ async def benchmark( print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="=")) print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) + if max_concurrency is not None: + print("{:<40} {:<10}".format("Maximum request concurrency:", max_concurrency)) + if request_rate != float("inf"): + print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", request_rate)) print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output)) diff --git a/benchmarks/benchmark_serving_structured_output.py b/benchmarks/benchmark_serving_structured_output.py index d535cd5d7e1a6..ca6843a72aa36 100644 --- a/benchmarks/benchmark_serving_structured_output.py +++ b/benchmarks/benchmark_serving_structured_output.py @@ -4,7 +4,7 @@ r"""Benchmark online serving throughput with structured outputs. On the server side, run one of the following commands: (vLLM OpenAI API server) - vllm serve --disable-log-requests + vllm serve On the client side, run: python benchmarks/benchmark_serving_structured_output.py \ @@ -555,6 +555,10 @@ async def benchmark( print("{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="=")) print("{:<40} {:<10}".format("Successful requests:", metrics.completed)) + if max_concurrency is not None: + print("{:<40} {:<10}".format("Maximum request concurrency:", max_concurrency)) + if request_rate != float("inf"): + print("{:<40} {:<10.2f}".format("Request rate configured (RPS):", request_rate)) print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration)) print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input)) print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output)) diff --git a/benchmarks/kernels/benchmark_moe.py b/benchmarks/kernels/benchmark_moe.py index c350aaf5d3ad2..72250e2fb6d2b 100644 --- a/benchmarks/kernels/benchmark_moe.py +++ b/benchmarks/kernels/benchmark_moe.py @@ -22,6 +22,13 @@ from vllm.utils import FlexibleArgumentParser FP8_DTYPE = current_platform.fp8_dtype() +def ensure_divisibility(numerator, denominator): + """Ensure that numerator is divisible by the denominator.""" + assert numerator % denominator == 0, ( + "intermediate_size {} is not divisible by tp {}.".format(numerator, denominator) + ) + + class BenchmarkConfig(TypedDict): BLOCK_SIZE_M: int BLOCK_SIZE_N: int @@ -603,7 +610,7 @@ def main(args: argparse.Namespace): topk = config.num_experts_per_tok intermediate_size = config.intermediate_size shard_intermediate_size = 2 * intermediate_size // args.tp_size - + ensure_divisibility(intermediate_size, args.tp_size) hidden_size = config.hidden_size dtype = torch.float16 if current_platform.is_rocm() else config.torch_dtype use_fp8_w8a8 = args.dtype == "fp8_w8a8" diff --git a/benchmarks/kernels/benchmark_per_token_group_quant.py b/benchmarks/kernels/benchmark_per_token_group_quant.py new file mode 100644 index 0000000000000..1ccb5e08b3d57 --- /dev/null +++ b/benchmarks/kernels/benchmark_per_token_group_quant.py @@ -0,0 +1,159 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import argparse +import math +from contextlib import contextmanager +from typing import Callable +from unittest.mock import patch + +import torch + +from vllm.model_executor.layers.quantization.utils import fp8_utils, int8_utils +from vllm.platforms import current_platform + + +@contextmanager +def _triton_mode(): + """Temporarily force the Triton fallback path""" + with patch("vllm.platforms.current_platform.is_cuda", return_value=False): + yield + + +def _time_cuda( + fn: Callable[[], tuple[torch.Tensor, torch.Tensor]], + warmup_iters: int, + bench_iters: int, +) -> float: + # warmup + for _ in range(warmup_iters): + fn() + torch.cuda.synchronize() + + start = torch.cuda.Event(enable_timing=True) + end = torch.cuda.Event(enable_timing=True) + + start.record() + for _ in range(bench_iters): + fn() + end.record() + torch.cuda.synchronize() + + return start.elapsed_time(end) / bench_iters # ms/iter + + +def _run_single( + shape: tuple[int, int], + group_size: int, + dtype: str, + *, + column_major: bool = False, + scale_ue8m0: bool = False, + warmup_iters: int, + bench_iters: int, +) -> None: + num_tokens, hidden_dim = shape + + device = torch.device("cuda") + torch.manual_seed(42) + x = torch.randn(num_tokens, hidden_dim, device=device, dtype=torch.bfloat16) * 8 + + if dtype == "fp8": + + def cuda_impl(): + return fp8_utils.per_token_group_quant_fp8( + x, + group_size, + column_major_scales=column_major, + use_ue8m0=scale_ue8m0, + ) + + def triton_impl(): + with _triton_mode(): + return fp8_utils.per_token_group_quant_fp8( + x, + group_size, + column_major_scales=column_major, + use_ue8m0=scale_ue8m0, + ) + elif dtype == "int8": + + def cuda_impl(): + return int8_utils.per_token_group_quant_int8(x, group_size) + + def triton_impl(): + with _triton_mode(): + return int8_utils.per_token_group_quant_int8(x, group_size) + else: + raise ValueError("dtype must be 'fp8' or 'int8'") + + cuda_ms = _time_cuda(cuda_impl, warmup_iters, bench_iters) + triton_ms = _time_cuda(triton_impl, warmup_iters, bench_iters) + + speedup = triton_ms / cuda_ms if cuda_ms else math.inf + + cfg_desc = ( + f"shape={shape} gs={group_size:<3} col_major={column_major:<5} " + f"ue8m0={scale_ue8m0:<5} dtype={dtype}" + ) + print( + f"{cfg_desc:55} | CUDA {cuda_ms:7.3f} ms | Triton {triton_ms:7.3f} ms | " + f"speed-up ×{speedup:5.2f}" + ) + + +def parse_args(): + parser = argparse.ArgumentParser() + parser.add_argument("--warmup-iters", type=int, default=10) + parser.add_argument("--bench-iters", type=int, default=100) + parser.add_argument("--dtype", choices=["fp8", "int8", "both"], default="both") + return parser.parse_args() + + +if __name__ == "__main__": + if not current_platform.is_cuda(): + raise RuntimeError("CUDA device is required to run this benchmark.") + + args = parse_args() + warmup_iters, bench_iters = args.warmup_iters, args.bench_iters + + shapes = [(32, 128), (64, 256), (16, 512)] + group_sizes = [64, 128] + + dtypes = ["fp8", "int8"] if args.dtype == "both" else [args.dtype] + + header = ( + "Configuration".ljust(55) + + " | " + + "CUDA (ms)".center(12) + + " | " + + "Triton (ms)".center(13) + + " | " + + "Speed-up" + ) + print(header) + print("-" * len(header)) + + for dtype in dtypes: + for shape in shapes: + for gs in group_sizes: + if dtype == "fp8": + for col_major in (False, True): + for ue8m0 in (False, True): + _run_single( + shape, + gs, + dtype, + column_major=col_major, + scale_ue8m0=ue8m0, + warmup_iters=warmup_iters, + bench_iters=bench_iters, + ) + else: # INT8 has no col-major / ue8m0 switches + _run_single( + shape, + gs, + dtype, + warmup_iters=warmup_iters, + bench_iters=bench_iters, + ) diff --git a/benchmarks/kernels/benchmark_reshape_and_cache_flash.py b/benchmarks/kernels/benchmark_reshape_and_cache_flash.py new file mode 100644 index 0000000000000..d4648c18f31d5 --- /dev/null +++ b/benchmarks/kernels/benchmark_reshape_and_cache_flash.py @@ -0,0 +1,156 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project +from __future__ import annotations + +import random +import time + +import torch +from tabulate import tabulate + +from vllm import _custom_ops as ops +from vllm.logger import init_logger +from vllm.platforms import current_platform +from vllm.utils import ( + STR_DTYPE_TO_TORCH_DTYPE, + FlexibleArgumentParser, + create_kv_caches_with_random_flash, +) + +logger = init_logger(__name__) + + +@torch.inference_mode() +def run_benchmark( + num_tokens: int, + num_heads: int, + head_size: int, + block_size: int, + num_blocks: int, + dtype: torch.dtype, + kv_cache_dtype: str, + kv_cache_layout: str, + num_iters: int, + device: str = "cuda", +) -> float: + """Return latency (seconds) for given num_tokens.""" + + if kv_cache_dtype == "fp8" and head_size % 16: + raise ValueError("fp8 kv-cache requires head_size to be a multiple of 16.") + + current_platform.seed_everything(42) + torch.set_default_device(device) + + # create random key / value tensors [T, H, D]. + key = torch.randn(num_tokens, num_heads, head_size, dtype=dtype, device=device) + value = torch.randn_like(key) + + # prepare the slot mapping. + # each token is assigned a unique slot in the KV-cache. + num_slots = block_size * num_blocks + if num_tokens > num_slots: + raise ValueError("num_tokens cannot exceed the total number of cache slots") + slot_mapping_lst = random.sample(range(num_slots), num_tokens) + slot_mapping = torch.tensor(slot_mapping_lst, dtype=torch.long, device=device) + + key_caches, value_caches = create_kv_caches_with_random_flash( + num_blocks, + block_size, + 1, # num_layers + num_heads, + head_size, + kv_cache_dtype, + dtype, + device=device, + cache_layout=kv_cache_layout, + ) + key_cache, value_cache = key_caches[0], value_caches[0] + + # compute per-kernel scaling factors for fp8 conversion (if used). + k_scale = (key.amax() / 64.0).to(torch.float32) + v_scale = (value.amax() / 64.0).to(torch.float32) + + def run_cuda_benchmark(n_iters: int) -> float: + nonlocal key, value, key_cache, value_cache, slot_mapping + torch.cuda.synchronize() + start = time.perf_counter() + for _ in range(n_iters): + ops.reshape_and_cache_flash( + key, + value, + key_cache, + value_cache, + slot_mapping, + kv_cache_dtype, + k_scale, + v_scale, + ) + torch.cuda.synchronize() + end = time.perf_counter() + return (end - start) / n_iters + + # warm-up + run_cuda_benchmark(3) + + lat = run_cuda_benchmark(num_iters) + + # free tensors to mitigate OOM when sweeping + del key, value, key_cache, value_cache, slot_mapping + torch.cuda.empty_cache() + + return lat + + +def main(args): + rows = [] + for layout in ["NHD", "HND"]: + for exp in range(1, 17): + n_tok = 2**exp + lat = run_benchmark( + num_tokens=n_tok, + num_heads=args.num_heads, + head_size=args.head_size, + block_size=args.block_size, + num_blocks=args.num_blocks, + dtype=STR_DTYPE_TO_TORCH_DTYPE[args.dtype], + kv_cache_dtype=args.kv_cache_dtype, + kv_cache_layout=layout, + num_iters=args.iters, + device="cuda", + ) + rows.append([n_tok, layout, f"{lat * 1e6:.3f}"]) + + print(tabulate(rows, headers=["num_tokens", "layout", "latency (µs)"])) + + +if __name__ == "__main__": + parser = FlexibleArgumentParser() + + parser.add_argument("--num-heads", type=int, default=128) + parser.add_argument( + "--head-size", + type=int, + choices=[64, 80, 96, 112, 120, 128, 192, 256], + default=128, + ) + parser.add_argument("--block-size", type=int, choices=[16, 32], default=16) + parser.add_argument("--num-blocks", type=int, default=128 * 512) + + parser.add_argument( + "--dtype", + type=str, + choices=["half", "bfloat16", "float"], + default="bfloat16", + ) + + parser.add_argument( + "--kv-cache-dtype", + type=str, + choices=["auto", "fp8"], + default="auto", + ) + + parser.add_argument("--iters", type=int, default=100) + args = parser.parse_args() + + main(args) diff --git a/benchmarks/kernels/benchmark_trtllm_attention.py b/benchmarks/kernels/benchmark_trtllm_decode_attention.py similarity index 88% rename from benchmarks/kernels/benchmark_trtllm_attention.py rename to benchmarks/kernels/benchmark_trtllm_decode_attention.py index 8c980f930366c..77136edca45b5 100644 --- a/benchmarks/kernels/benchmark_trtllm_attention.py +++ b/benchmarks/kernels/benchmark_trtllm_decode_attention.py @@ -41,7 +41,6 @@ def benchmark_decode( device = "cuda" torch.manual_seed(0) - # Currently only HEAD_GRP_SIZE == 8 is supported HEAD_GRP_SIZE = 8 MAX_SEQ_LEN = max_seq_len @@ -71,22 +70,20 @@ def benchmark_decode( if kv_cache_dtype.startswith("fp8"): kv_cache, _ = to_float8(kv_cache) + output_trtllm = torch.empty(q.shape, dtype=dtype) + # Benchmark TRT decode def trt_decode(): return flashinfer.decode.trtllm_batch_decode_with_kv_cache( q, kv_cache, workspace_buffer, - num_qo_heads, - num_kv_heads, - sm_scale, block_tables, kv_lens_tensor, - page_size, max_kv_len, - kv_cache_dtype, - k_scale, - v_scale, + bmm1_scale=k_scale * sm_scale, + bmm2_scale=v_scale, + out=output_trtllm, ) def time_fn(fn, warmup=10, trials=20): @@ -125,6 +122,8 @@ def benchmark_decode( kv_indices = torch.tensor(kv_indices, dtype=torch.int32) kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32) + output_baseline = torch.empty(q.shape, dtype=dtype) + wrapper = flashinfer.BatchDecodeWithPagedKVCacheWrapper( workspace_buffer, kv_layout, @@ -145,7 +144,7 @@ def benchmark_decode( ) def baseline_decode(): - return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale) + return wrapper.run(q, kv_cache, sm_scale, k_scale, v_scale, output_baseline) baseline_mean, baseline_std = time_fn(baseline_decode) @@ -214,25 +213,39 @@ if __name__ == "__main__": max_seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072] all_results = [] - print("Running benchmark for kv_cache_dtype: bfloat16") print( - "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent" + "Running benchmark for q_dtype = bfloat16, kv_cache_dtype: bfloat16, " + "output_dtype: bfloat16" + ) + print( + "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t" + "baseline_std\tspeedup_percent" ) for max_seq_len in max_seq_lens: for bs in num_seqs: result = benchmark_decode( - bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="auto" + bs, + max_seq_len, + dtype=torch.bfloat16, + kv_cache_dtype="auto", ) all_results.append(result) - print("Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8") print( - "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\tbaseline_std\tspeedup_percent" + "Running benchmark for q_dtype = bfloat16, kv_cache_dtype: fp8, " + "output_dtype: bfloat16" + ) + print( + "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t" + "baseline_std\tspeedup_percent" ) for max_seq_len in max_seq_lens: for bs in num_seqs: result = benchmark_decode( - bs, max_seq_len, dtype=torch.bfloat16, kv_cache_dtype="fp8" + bs, + max_seq_len, + dtype=torch.bfloat16, + kv_cache_dtype="fp8", ) all_results.append(result) diff --git a/benchmarks/kernels/benchmark_trtllm_prefill_attention.py b/benchmarks/kernels/benchmark_trtllm_prefill_attention.py new file mode 100644 index 0000000000000..67bd9aebbcca9 --- /dev/null +++ b/benchmarks/kernels/benchmark_trtllm_prefill_attention.py @@ -0,0 +1,250 @@ +# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright contributors to the vLLM project + +import csv +import os +import random +from datetime import datetime + +import flashinfer +import torch + +FLOAT32_BYTES = torch.finfo(torch.float).bits // 8 + +# KV Cache Layout for TRT-LLM +# kv_cache_shape = (num_blocks, 2, num_kv_heads, page_size, head_dim) + + +def to_float8(x, dtype=torch.float8_e4m3fn): + finfo = torch.finfo(dtype) + min_val, max_val = x.aminmax() + amax = torch.maximum(min_val.abs(), max_val.abs()).clamp(min=1e-12) + scale = finfo.max / amax * 0.1 + x_scl_sat = (x * scale).clamp(min=finfo.min, max=finfo.max) + return x_scl_sat.to(dtype), scale.float().reciprocal() + + +@torch.no_grad() +def benchmark_prefill( + num_seqs, + max_seq_len, + page_size=16, + dtype=torch.bfloat16, + kv_layout="HND", + num_kv_heads=8, + kv_cache_dtype="auto", + head_dim=128, + warmup=10, + trials=20, +): + torch.set_default_device("cuda") + torch.manual_seed(0) + + HEAD_GRP_SIZE = 8 + MAX_SEQ_LEN = max_seq_len + + # large number to reduce kv_cache reuse + NUM_BLOCKS = int(256000 / page_size) + + workspace_buffer = torch.empty(1024 * 1024 * 1024, dtype=torch.int8) + + num_qo_heads = num_kv_heads * HEAD_GRP_SIZE + sm_scale = float(1.0 / (head_dim**0.5)) + + q_lens = [random.randint(1, MAX_SEQ_LEN) for _ in range(num_seqs)] + q_lens[-1] = MAX_SEQ_LEN + max_q_len = max(q_lens) + q_indptr = torch.cat( + [ + torch.tensor([0], dtype=torch.int32), + torch.cumsum( + torch.tensor(q_lens, dtype=torch.int32), dim=0, dtype=torch.int32 + ), + ] + ) + q = torch.randn(sum(q_lens), num_qo_heads, head_dim, dtype=dtype) + + kv_lens = [random.randint(0, MAX_SEQ_LEN) for _ in range(num_seqs)] + kv_lens[-1] = MAX_SEQ_LEN + + seq_lens = [q_len + kv_len for q_len, kv_len in zip(q_lens, kv_lens)] + max_seq_len = max(seq_lens) + seq_lens_tensor = torch.tensor(seq_lens, dtype=torch.int32) + + max_num_blocks_per_seq = (max_seq_len + page_size - 1) // page_size + block_tables = torch.randint( + 0, NUM_BLOCKS, (num_seqs, max_num_blocks_per_seq), dtype=torch.int32 + ) + + kv_cache_shape = (NUM_BLOCKS, 2, num_kv_heads, page_size, head_dim) + kv_cache = torch.randn(size=kv_cache_shape, dtype=dtype) + k_scale = v_scale = 1.0 + + if kv_cache_dtype.startswith("fp8"): + kv_cache, _ = to_float8(kv_cache) + + output_trtllm = torch.empty(q.shape, dtype=dtype) + + kv_indptr = [0] + kv_indices = [] + kv_last_page_lens = [] + for i in range(num_seqs): + seq_len = seq_lens[i] + assert seq_len > 0 + num_blocks = (seq_len + page_size - 1) // page_size + kv_indices.extend(block_tables[i, :num_blocks]) + kv_indptr.append(kv_indptr[-1] + num_blocks) + kv_last_page_len = seq_len % page_size + if kv_last_page_len == 0: + kv_last_page_len = page_size + kv_last_page_lens.append(kv_last_page_len) + + kv_indptr = torch.tensor(kv_indptr, dtype=torch.int32) + kv_indices = torch.tensor(kv_indices, dtype=torch.int32) + kv_last_page_lens = torch.tensor(kv_last_page_lens, dtype=torch.int32) + + output_baseline = torch.empty(q.shape, dtype=dtype) + + wrapper = flashinfer.BatchPrefillWithPagedKVCacheWrapper( + workspace_buffer, kv_layout + ) + wrapper.plan( + q_indptr, + kv_indptr, + kv_indices, + kv_last_page_lens, + num_qo_heads, + num_kv_heads, + head_dim, + page_size, + causal=True, + sm_scale=sm_scale, + q_data_type=dtype, + kv_data_type=kv_cache.dtype, + ) + + def time_fn(fn, warmup=10, trials=20): + torch.cuda.synchronize() + start = torch.cuda.Event(enable_timing=True) + end = torch.cuda.Event(enable_timing=True) + times = [] + for i in range(warmup): + fn() + for i in range(trials): + start.record() + fn() + end.record() + torch.cuda.synchronize() + times.append(start.elapsed_time(end)) # ms + return sum(times) / len(times), torch.std(torch.tensor(times)) + + def baseline_prefill(): + return wrapper.run( + q, kv_cache, k_scale=k_scale, v_scale=v_scale, out=output_baseline + ) + + def trt_prefill(): + return flashinfer.prefill.trtllm_batch_context_with_kv_cache( + query=q, + kv_cache=kv_cache, + workspace_buffer=workspace_buffer, + block_tables=block_tables, + seq_lens=seq_lens_tensor, + max_q_len=max_q_len, + max_kv_len=max_seq_len, + bmm1_scale=k_scale * sm_scale, + bmm2_scale=v_scale, + batch_size=num_seqs, + cum_seq_lens_q=q_indptr, + cum_seq_lens_kv=kv_indptr, + out=output_trtllm, + ) + + trt_mean, trt_std = time_fn(trt_prefill) + baseline_mean, baseline_std = time_fn(baseline_prefill) + + # Calculate percentage speedup (positive means TRT is faster) + speedup_percent = (baseline_mean - trt_mean) / baseline_mean + + print( + f"\t{num_seqs}\t{max_seq_len}\t{trt_mean:.5f}\t{trt_std.item():.5f}" + f"\t{baseline_mean:.5f}\t{baseline_std.item():.5f}\t{speedup_percent:.5f}" + ) + + # Return results for CSV writing + return { + "num_seqs": num_seqs, + "trt_mean": trt_mean, + "trt_std": trt_std.item(), + "baseline_mean": baseline_mean, + "baseline_std": baseline_std.item(), + "speedup_percent": speedup_percent, + "q_dtype": str(dtype), + "kv_cache_dtype": kv_cache_dtype, + "page_size": page_size, + "num_kv_heads": num_kv_heads, + "head_dim": head_dim, + "max_seq_len": max_seq_len, + } + + +def write_results_to_csv(results, filename=None): + """Write benchmark results to CSV file.""" + if filename is None: + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + filename = f"flashinfer_trtllm_benchmark_{timestamp}.csv" + + fieldnames = [ + "num_seqs", + "trt_mean", + "trt_std", + "baseline_mean", + "baseline_std", + "speedup_percent", + "q_dtype", + "kv_cache_dtype", + "page_size", + "num_kv_heads", + "head_dim", + "max_seq_len", + ] + + file_exists = os.path.exists(filename) + + with open(filename, "a", newline="") as csvfile: + writer = csv.DictWriter(csvfile, fieldnames=fieldnames) + + if not file_exists: + writer.writeheader() + + for result in results: + writer.writerow(result) + + print(f"Results written to {filename}") + + +if __name__ == "__main__": + num_seqs = [1, 4, 8, 16, 32, 64, 128, 256] + max_seq_lens = [1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072] + all_results = [] + + print( + "Running benchmark for q_dtype = bfloat16, kv_cache_dtype: bfloat16, " + "output_dtype: bfloat16" + ) + print( + "\tnum_seqs\tmax_seq_len\ttrt_mean\ttrt_std\tbaseline_mean\t" + "baseline_std\tspeedup_percent" + ) + for max_seq_len in max_seq_lens: + for bs in num_seqs: + result = benchmark_prefill( + bs, + max_seq_len, + dtype=torch.bfloat16, + kv_cache_dtype="auto", + ) + all_results.append(result) + + # Write all results to CSV + write_results_to_csv(all_results) diff --git a/benchmarks/kernels/deepgemm/README.md b/benchmarks/kernels/deepgemm/README.md index 917e814010f89..41e68e047be82 100644 --- a/benchmarks/kernels/deepgemm/README.md +++ b/benchmarks/kernels/deepgemm/README.md @@ -8,7 +8,7 @@ Currently this just includes dense GEMMs and only works on Hopper GPUs. You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory: -``` +```bash git clone --recursive https://github.com/deepseek-ai/DeepGEMM cd DeepGEMM python setup.py install @@ -17,7 +17,7 @@ uv pip install -e . ## Usage -``` +```console python benchmark_fp8_block_dense_gemm.py INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda. ===== STARTING FP8 GEMM BENCHMARK ===== diff --git a/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py b/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py index 43c54d56ca8c1..b99c2099f2c38 100644 --- a/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py +++ b/benchmarks/kernels/deepgemm/benchmark_fp8_block_dense_gemm.py @@ -4,49 +4,16 @@ # ruff: noqa: E501 import time -# Import DeepGEMM functions -import deep_gemm import torch -from deep_gemm import calc_diff, ceil_div, get_col_major_tma_aligned_tensor -# Import vLLM functions from vllm import _custom_ops as ops from vllm.model_executor.layers.quantization.utils.fp8_utils import ( + get_col_major_tma_aligned_tensor, per_token_group_quant_fp8, w8a8_block_fp8_matmul, ) from vllm.triton_utils import triton - - -# Copied from -# https://github.com/deepseek-ai/DeepGEMM/blob/78cacf70d41d15d688bd493ebc85845f7f2a3d5d/tests/test_core.py#L9 -def per_token_cast_to_fp8( - x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: - """Convert tensor to FP8 format with per-token scaling.""" - assert x.dim() == 2 and x.size(1) % 128 == 0 - m, n = x.shape - x_view = x.view(m, -1, 128) - x_amax = x_view.abs().float().amax(dim=2).view(m, -1).clamp(1e-4) - return (x_view * (448.0 / x_amax.unsqueeze(2))).to( - torch.float8_e4m3fn).view(m, n), (x_amax / 448.0).view(m, -1) - - -# Copied from -# https://github.com/deepseek-ai/DeepGEMM/blob/78cacf70d41d15d688bd493ebc85845f7f2a3d5d/tests/test_core.py#L17 -def per_block_cast_to_fp8( - x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: - """Convert tensor to FP8 format with per-block scaling.""" - assert x.dim() == 2 - m, n = x.shape - x_padded = torch.zeros((ceil_div(m, 128) * 128, ceil_div(n, 128) * 128), - dtype=x.dtype, - device=x.device) - x_padded[:m, :n] = x - x_view = x_padded.view(-1, 128, x_padded.size(1) // 128, 128) - x_amax = x_view.abs().float().amax(dim=(1, 3), keepdim=True).clamp(1e-4) - x_scaled = (x_view * (448.0 / x_amax)).to(torch.float8_e4m3fn) - return x_scaled.view_as(x_padded)[:m, :n].contiguous(), ( - x_amax / 448.0).view(x_view.size(0), x_view.size(2)) +from vllm.utils.deep_gemm import calc_diff, fp8_gemm_nt, per_block_cast_to_fp8 def benchmark_shape(m: int, @@ -69,14 +36,14 @@ def benchmark_shape(m: int, # Pre-quantize B for all implementations # (weights can be pre-quantized offline) - B_deepgemm, B_scale_deepgemm = per_block_cast_to_fp8(B) - B_vllm, B_scale_vllm = per_block_cast_to_fp8(B) + B_deepgemm, B_scale_deepgemm = per_block_cast_to_fp8(B, [128, 128], use_ue8m0=True) + B_vllm, B_scale_vllm = per_block_cast_to_fp8(B, [128, 128], use_ue8m0=True) # Block size configuration block_size = [128, 128] # Pre-quantize A for all implementations - A_deepgemm, A_scale_deepgemm = per_token_cast_to_fp8(A) + A_deepgemm, A_scale_deepgemm = per_token_group_quant_fp8(A, block_size[1]) A_scale_deepgemm = get_col_major_tma_aligned_tensor(A_scale_deepgemm) C_deepgemm = torch.empty((m, n), device='cuda', dtype=torch.bfloat16) A_vllm, A_scale_vllm = per_token_group_quant_fp8(A, block_size[1]) @@ -85,7 +52,7 @@ def benchmark_shape(m: int, # === DeepGEMM Implementation === def deepgemm_gemm(): - deep_gemm.gemm_fp8_fp8_bf16_nt((A_deepgemm, A_scale_deepgemm), + fp8_gemm_nt((A_deepgemm, A_scale_deepgemm), (B_deepgemm, B_scale_deepgemm), C_deepgemm) return C_deepgemm diff --git a/csrc/cache_kernels.cu b/csrc/cache_kernels.cu index 88559c8fe7183..131dcb15cd7e9 100644 --- a/csrc/cache_kernels.cu +++ b/csrc/cache_kernels.cu @@ -5,6 +5,7 @@ #include "cuda_utils.h" #include "cuda_compat.h" #include "dispatch_utils.h" +#include "quantization/vectorization_utils.cuh" #ifdef USE_ROCM #include "quantization/fp8/amd/quant_utils.cuh" @@ -261,14 +262,26 @@ __global__ void reshape_and_cache_kernel( } } +// Used by vectorization_utils to copy/convert one element +template +struct CopyWithScaleOp { + float scale; + + __device__ __forceinline__ void operator()(OutT& dst, const InT src) const { + if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) { + dst = static_cast(src); + } else { + dst = fp8::scaled_convert(src, scale); + } + } +}; + template __global__ void reshape_and_cache_flash_kernel( const scalar_t* __restrict__ key, // [num_tokens, num_heads, head_size] const scalar_t* __restrict__ value, // [num_tokens, num_heads, head_size] - cache_t* __restrict__ key_cache, // [num_blocks, block_size, num_heads, - // head_size] - cache_t* __restrict__ value_cache, // [num_blocks, block_size, num_heads, - // head_size] + cache_t* __restrict__ key_cache, // NHD or HND, shape see comments below + cache_t* __restrict__ value_cache, // same above const int64_t* __restrict__ slot_mapping, // [num_tokens] const int64_t block_stride, const int64_t page_stride, const int64_t head_stride, const int64_t key_stride, @@ -282,25 +295,58 @@ __global__ void reshape_and_cache_flash_kernel( } const int64_t block_idx = slot_idx / block_size; const int64_t block_offset = slot_idx % block_size; - const int n = num_heads * head_size; - for (int i = threadIdx.x; i < n; i += blockDim.x) { - const int64_t src_key_idx = token_idx * key_stride + i; - const int64_t src_value_idx = token_idx * value_stride + i; - const int head_idx = i / head_size; - const int head_offset = i % head_size; - const int64_t tgt_key_value_idx = block_idx * block_stride + - block_offset * page_stride + - head_idx * head_stride + head_offset; - scalar_t tgt_key = key[src_key_idx]; - scalar_t tgt_value = value[src_value_idx]; - if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) { - key_cache[tgt_key_value_idx] = tgt_key; - value_cache[tgt_key_value_idx] = tgt_value; - } else { - key_cache[tgt_key_value_idx] = - fp8::scaled_convert(tgt_key, *k_scale); - value_cache[tgt_key_value_idx] = - fp8::scaled_convert(tgt_value, *v_scale); + const int n_elems = num_heads * head_size; + + // pointers to the beginning of the source row for this token. + const scalar_t* __restrict__ key_src = key + token_idx * key_stride; + const scalar_t* __restrict__ value_src = value + token_idx * value_stride; + + // find the start position inside the kv-cache for this token. + cache_t* __restrict__ key_dst = + key_cache + block_idx * block_stride + block_offset * page_stride; + cache_t* __restrict__ value_dst = + value_cache + block_idx * block_stride + block_offset * page_stride; + + // this is true for the NHD layout where `head_stride == head_size` + const bool is_contiguous_heads = (head_stride == head_size); + + float k_scale_val = (kv_dt == Fp8KVCacheDataType::kAuto) ? 0.f : *k_scale; + float v_scale_val = (kv_dt == Fp8KVCacheDataType::kAuto) ? 0.f : *v_scale; + constexpr int VEC_SIZE = (sizeof(scalar_t) == 2) ? 8 : 4; + CopyWithScaleOp k_op{k_scale_val}; + CopyWithScaleOp v_op{v_scale_val}; + if (is_contiguous_heads) { + // NHD layout + // kv cache: [num_blocks, block_size, num_heads, head_size] + vectorize_with_alignment(key_src, key_dst, n_elems, threadIdx.x, + blockDim.x, k_op); + + vectorize_with_alignment(value_src, value_dst, n_elems, + threadIdx.x, blockDim.x, v_op); + + } else { + // HND layout: heads are strided, but each head_size segment is contiguous + // kv cache: [num_blocks, num_heads, block_size, head_size] + const int lane = threadIdx.x & 31; // 0..31 within warp + const int warp_id = threadIdx.x >> 5; // warp index within block + const int warps_per_block = blockDim.x >> 5; + + for (int head = warp_id; head < num_heads; head += warps_per_block) { + const scalar_t* __restrict__ k_src_h = key_src + head * head_size; + const scalar_t* __restrict__ v_src_h = value_src + head * head_size; + + cache_t* __restrict__ k_dst_h = + key_dst + static_cast(head) * head_stride; + cache_t* __restrict__ v_dst_h = + value_dst + static_cast(head) * head_stride; + + // within each head, let the 32 threads of the warp perform the vector + // copy + vectorize_with_alignment(k_src_h, k_dst_h, head_size, lane, 32, + k_op); + + vectorize_with_alignment(v_src_h, v_dst_h, head_size, lane, 32, + v_op); } } } diff --git a/csrc/cpu/quant.cpp b/csrc/cpu/quant.cpp index c1f7c64ea2f49..6e120b8d20a7e 100644 --- a/csrc/cpu/quant.cpp +++ b/csrc/cpu/quant.cpp @@ -16,12 +16,14 @@ struct KernelVecType { using cvt_vec_type = vec_op::FP32Vec16; }; +#if !defined(__aarch64__) || defined(ARM_BF16_SUPPORT) template <> struct KernelVecType { using load_vec_type = vec_op::BF16Vec16; using azp_adj_load_vec_type = vec_op::INT32Vec16; using cvt_vec_type = vec_op::FP32Vec16; }; +#endif template <> struct KernelVecType { diff --git a/csrc/moe/topk_softmax_kernels.cu b/csrc/moe/topk_softmax_kernels.cu index 0b505d2e04a21..7a7865b901de1 100644 --- a/csrc/moe/topk_softmax_kernels.cu +++ b/csrc/moe/topk_softmax_kernels.cu @@ -24,9 +24,12 @@ #ifndef USE_ROCM #include #include + #include + using AddOp = cuda::std::plus; #else #include #include + using AddOp = cub::Sum; #endif #define MAX(a, b) ((a) > (b) ? (a) : (b)) @@ -62,7 +65,6 @@ __launch_bounds__(TPB) __global__ const int thread_row_offset = blockIdx.x * num_cols; - cub::Sum sum; float threadData(-FLT_MAX); // Don't touch finished rows. @@ -92,7 +94,7 @@ __launch_bounds__(TPB) __global__ threadData += exp((static_cast(input[idx]) - float_max)); } - const auto Z = BlockReduce(tmpStorage).Reduce(threadData, sum); + const auto Z = BlockReduce(tmpStorage).Reduce(threadData, AddOp()); if (threadIdx.x == 0) { diff --git a/csrc/quantization/compressed_tensors/int8_quant_kernels.cu b/csrc/quantization/compressed_tensors/int8_quant_kernels.cu index 6a81f159f46ae..d8369108d0bd3 100644 --- a/csrc/quantization/compressed_tensors/int8_quant_kernels.cu +++ b/csrc/quantization/compressed_tensors/int8_quant_kernels.cu @@ -1,7 +1,9 @@ #include #include -#include "../per_token_group_quant_8bit.h" +#ifndef USE_ROCM + #include "../per_token_group_quant_8bit.h" +#endif #include @@ -339,10 +341,12 @@ void dynamic_scaled_int8_quant( }); } +#ifndef USE_ROCM void per_token_group_quant_int8(const torch::Tensor& input, torch::Tensor& output_q, torch::Tensor& output_s, int64_t group_size, double eps, double int8_min, double int8_max) { per_token_group_quant_8bit(input, output_q, output_s, group_size, eps, int8_min, int8_max); -} \ No newline at end of file +} +#endif diff --git a/csrc/quantization/cutlass_w8a8/Epilogues.md b/csrc/quantization/cutlass_w8a8/Epilogues.md index a30e1fdf3ac77..15a66913e97a3 100644 --- a/csrc/quantization/cutlass_w8a8/Epilogues.md +++ b/csrc/quantization/cutlass_w8a8/Epilogues.md @@ -86,6 +86,7 @@ D = s_a s_b \widehat A \widehat B ``` Epilogue parameters: + - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). @@ -135,7 +136,7 @@ That is precomputed and stored in `azp_with_adj` as a row-vector. Epilogue parameters: - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - - Generally this will be per-tensor as the zero-points are per-tensor. + - Generally this will be per-tensor as the zero-points are per-tensor. - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector). - `bias` is the bias, is always per-channel (row-vector). @@ -152,7 +153,7 @@ That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product Epilogue parameters: - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - - Generally this will be per-token as the zero-points are per-token. + - Generally this will be per-token as the zero-points are per-token. - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector). - `azp` is the zero-point (`z_a`), is per-token (column-vector). diff --git a/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu b/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu index a21ee55b65862..03db5cc196d59 100644 --- a/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu +++ b/csrc/quantization/fp4/nvfp4_blockwise_moe_kernel.cu @@ -335,7 +335,7 @@ void run_fp4_blockwise_scaled_group_mm( TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to run GEMM"); } -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 +#if defined ENABLE_NVFP4_SM100 && ENABLE_NVFP4_SM100 constexpr auto FLOAT4_E2M1X2 = at::ScalarType::Byte; constexpr auto SF_DTYPE = at::ScalarType::Float8_e4m3fn; #endif @@ -356,7 +356,7 @@ void cutlass_fp4_group_mm( const torch::Tensor& a_blockscale, const torch::Tensor& b_blockscales, const torch::Tensor& alphas, const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets, const torch::Tensor& sf_offsets) { -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 +#if defined ENABLE_NVFP4_SM100 && ENABLE_NVFP4_SM100 // Input validation CHECK_INPUT(a, FLOAT4_E2M1X2, "a"); CHECK_INPUT(b, FLOAT4_E2M1X2, "b"); @@ -398,7 +398,7 @@ void cutlass_fp4_group_mm( TORCH_CHECK_NOT_IMPLEMENTED( false, "No compiled cutlass_fp4_group_mm kernel, vLLM must " - "be compiled with ENABLE_NVFP4 for SM100+ and CUDA " + "be compiled with ENABLE_NVFP4_SM100 for SM100+ and CUDA " "12.8 or above."); #endif } diff --git a/csrc/quantization/fp4/nvfp4_quant_entry.cu b/csrc/quantization/fp4/nvfp4_quant_entry.cu index badbb7e310df0..1b61bd4519fc3 100644 --- a/csrc/quantization/fp4/nvfp4_quant_entry.cu +++ b/csrc/quantization/fp4/nvfp4_quant_entry.cu @@ -16,14 +16,15 @@ #include -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 -void scaled_fp4_quant_sm100a(torch::Tensor const& output, +#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \ + (defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120) +void scaled_fp4_quant_sm1xxa(torch::Tensor const& output, torch::Tensor const& input, torch::Tensor const& output_sf, torch::Tensor const& input_sf); #endif -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 +#if defined ENABLE_NVFP4_SM100 && ENABLE_NVFP4_SM100 void scaled_fp4_experts_quant_sm100a( torch::Tensor& output, torch::Tensor& output_scale, torch::Tensor const& input, torch::Tensor const& input_global_scale, @@ -33,8 +34,9 @@ void scaled_fp4_experts_quant_sm100a( void scaled_fp4_quant(torch::Tensor& output, torch::Tensor const& input, torch::Tensor& output_sf, torch::Tensor const& input_sf) { -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 - return scaled_fp4_quant_sm100a(output, input, output_sf, input_sf); +#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \ + (defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120) + return scaled_fp4_quant_sm1xxa(output, input, output_sf, input_sf); #endif TORCH_CHECK_NOT_IMPLEMENTED(false, "No compiled nvfp4 quantization kernel"); } @@ -44,7 +46,7 @@ void scaled_fp4_experts_quant( torch::Tensor const& input, torch::Tensor const& input_global_scale, torch::Tensor const& input_offset_by_experts, torch::Tensor const& output_scale_offset_by_experts) { -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 +#if defined ENABLE_NVFP4_SM100 && ENABLE_NVFP4_SM100 return scaled_fp4_experts_quant_sm100a( output, output_scale, input, input_global_scale, input_offset_by_experts, output_scale_offset_by_experts); diff --git a/csrc/quantization/fp4/nvfp4_quant_kernels.cu b/csrc/quantization/fp4/nvfp4_quant_kernels.cu index d32911357a953..4e080de151648 100644 --- a/csrc/quantization/fp4/nvfp4_quant_kernels.cu +++ b/csrc/quantization/fp4/nvfp4_quant_kernels.cu @@ -332,7 +332,7 @@ template void invokeFP4Quantization(int m, int n, __nv_bfloat16 const* input, int multiProcessorCount, cudaStream_t stream); -void scaled_fp4_quant_sm100a(torch::Tensor const& output, +void scaled_fp4_quant_sm1xxa(torch::Tensor const& output, torch::Tensor const& input, torch::Tensor const& output_sf, torch::Tensor const& input_sf) { diff --git a/csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu b/csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu index 61b75e92dfaa0..9cba2828aac2e 100644 --- a/csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu +++ b/csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu @@ -16,7 +16,7 @@ #include -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 +#if defined ENABLE_NVFP4_SM100 && ENABLE_NVFP4_SM100 void cutlass_scaled_fp4_mm_sm100a(torch::Tensor& D, torch::Tensor const& A, torch::Tensor const& B, torch::Tensor const& A_sf, @@ -24,12 +24,22 @@ void cutlass_scaled_fp4_mm_sm100a(torch::Tensor& D, torch::Tensor const& A, torch::Tensor const& alpha); #endif +#if defined ENABLE_NVFP4_SM120 && ENABLE_NVFP4_SM120 +void cutlass_scaled_fp4_mm_sm120a(torch::Tensor& D, torch::Tensor const& A, + torch::Tensor const& B, + torch::Tensor const& A_sf, + torch::Tensor const& B_sf, + torch::Tensor const& alpha); +#endif + void cutlass_scaled_fp4_mm(torch::Tensor& D, torch::Tensor const& A, torch::Tensor const& B, torch::Tensor const& A_sf, torch::Tensor const& B_sf, torch::Tensor const& alpha) { -#if defined ENABLE_NVFP4 && ENABLE_NVFP4 +#if defined ENABLE_NVFP4_SM100 && ENABLE_NVFP4_SM100 return cutlass_scaled_fp4_mm_sm100a(D, A, B, A_sf, B_sf, alpha); +#elif defined ENABLE_NVFP4_SM120 && ENABLE_NVFP4_SM120 + return cutlass_scaled_fp4_mm_sm120a(D, A, B, A_sf, B_sf, alpha); #endif TORCH_CHECK_NOT_IMPLEMENTED(false, "No compiled nvfp4 mm kernel, vLLM should " diff --git a/csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu b/csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu new file mode 100644 index 0000000000000..89de23b76e65d --- /dev/null +++ b/csrc/quantization/fp4/nvfp4_scaled_mm_sm120_kernels.cu @@ -0,0 +1,285 @@ +/* + * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include +#include + +#include "cutlass_extensions/common.hpp" + +#include "cutlass/cutlass.h" + +#include "cutlass/gemm/collective/collective_builder.hpp" +#include "cutlass/epilogue/collective/collective_builder.hpp" +#include "cutlass/gemm/device/gemm_universal_adapter.h" +#include "cutlass/gemm/kernel/gemm_universal.hpp" + +#include "cutlass/util/packed_stride.hpp" + +#include "core/math.hpp" + +using namespace cute; + +#define CHECK_TYPE(x, st, m) \ + TORCH_CHECK(x.scalar_type() == st, ": Inconsistency of Tensor type:", m) +#define CHECK_TH_CUDA(x, m) \ + TORCH_CHECK(x.is_cuda(), m, ": must be a CUDA tensor") +#define CHECK_CONTIGUOUS(x, m) \ + TORCH_CHECK(x.is_contiguous(), m, ": must be contiguous") +#define CHECK_INPUT(x, st, m) \ + CHECK_TH_CUDA(x, m); \ + CHECK_CONTIGUOUS(x, m); \ + CHECK_TYPE(x, st, m) + +constexpr auto FLOAT4_E2M1X2 = at::ScalarType::Byte; +constexpr auto SF_DTYPE = at::ScalarType::Float8_e4m3fn; + +struct sm120_fp4_config_M256 { + using ClusterShape = Shape<_1, _1, _1>; + using MmaTileShape = Shape<_128, _128, _128>; + using PerSmTileShape_MNK = Shape<_128, _128, _128>; +}; + +struct sm120_fp4_config_default { + using ClusterShape = Shape<_1, _1, _1>; + using MmaTileShape = Shape<_256, _128, _128>; + using PerSmTileShape_MNK = Shape<_256, _128, _128>; +}; + +template +struct Fp4GemmSm120 { + using ElementA = cutlass::nv_float4_t; + using LayoutATag = cutlass::layout::RowMajor; + static constexpr int AlignmentA = 32; + + using ElementB = cutlass::nv_float4_t; + using LayoutBTag = cutlass::layout::ColumnMajor; + static constexpr int AlignmentB = 32; + + using ElementD = OutType; + using ElementC = OutType; + using LayoutCTag = cutlass::layout::RowMajor; + using LayoutDTag = cutlass::layout::RowMajor; + static constexpr int AlignmentD = 128 / cutlass::sizeof_bits::value; + static constexpr int AlignmentC = 128 / cutlass::sizeof_bits::value; + + using ElementAccumulator = float; + using ArchTag = cutlass::arch::Sm120; + using OperatorClass = cutlass::arch::OpClassBlockScaledTensorOp; + + using MmaTileShape = typename Config::MmaTileShape; + using ClusterShape = typename Config::ClusterShape; + using PerSmTileShape_MNK = typename Config::PerSmTileShape_MNK; + + using CollectiveEpilogue = + typename cutlass::epilogue::collective::CollectiveBuilder< + ArchTag, OperatorClass, PerSmTileShape_MNK, ClusterShape, + cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator, + ElementAccumulator, ElementC, LayoutCTag, AlignmentC, ElementD, + LayoutDTag, AlignmentD, + cutlass::epilogue::collective::EpilogueScheduleAuto>::CollectiveOp; + + using CollectiveMainloop = + typename cutlass::gemm::collective::CollectiveBuilder< + ArchTag, OperatorClass, ElementA, LayoutATag, AlignmentA, ElementB, + LayoutBTag, AlignmentB, ElementAccumulator, MmaTileShape, + ClusterShape, + cutlass::gemm::collective::StageCountAutoCarveout( + sizeof(typename CollectiveEpilogue::SharedStorage))>, + cutlass::gemm::collective::KernelScheduleAuto>::CollectiveOp; + + using GemmKernel = cutlass::gemm::kernel::GemmUniversal< + Shape, CollectiveMainloop, CollectiveEpilogue, void>; + + using Gemm = cutlass::gemm::device::GemmUniversalAdapter; +}; + +template +typename Gemm::Arguments args_from_options(at::Tensor& D, at::Tensor const& A, + at::Tensor const& B, + at::Tensor const& A_sf, + at::Tensor const& B_sf, + torch::Tensor const& alpha, int M, + int N, int K) { + using ElementA = typename Gemm::ElementA; + using ElementB = typename Gemm::ElementB; + using ElementD = typename Gemm::ElementD; + using ElementSFA = cutlass::float_ue4m3_t; + using ElementSFB = cutlass::float_ue4m3_t; + using ElementCompute = float; + + using StrideA = typename Gemm::GemmKernel::StrideA; + using StrideB = typename Gemm::GemmKernel::StrideB; + using StrideC = typename Gemm::GemmKernel::StrideC; + using StrideD = typename Gemm::GemmKernel::StrideD; + + using Sm1xxBlkScaledConfig = + typename Gemm::GemmKernel::CollectiveMainloop::Sm1xxBlkScaledConfig; + + auto stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}); + auto stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}); + auto stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}); + + auto layout_SFA = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFA( + cute::make_shape(M, N, K, 1)); + auto layout_SFB = Sm1xxBlkScaledConfig::tile_atom_to_shape_SFB( + cute::make_shape(M, N, K, 1)); + + typename Gemm::Arguments arguments{ + cutlass::gemm::GemmUniversalMode::kGemm, + {M, N, K, 1}, + {static_cast(A.data_ptr()), stride_A, + static_cast(B.data_ptr()), stride_B, + static_cast(A_sf.data_ptr()), layout_SFA, + static_cast(B_sf.data_ptr()), layout_SFB}, + {{}, + static_cast(D.data_ptr()), + stride_D, + static_cast(D.data_ptr()), + stride_D}}; + auto& fusion_args = arguments.epilogue.thread; + fusion_args.alpha_ptr = static_cast(alpha.data_ptr()); + + return arguments; +} + +template +void runGemm(at::Tensor& D, at::Tensor const& A, at::Tensor const& B, + at::Tensor const& A_sf, at::Tensor const& B_sf, + torch::Tensor const& alpha, int M, int N, int K, + cudaStream_t stream) { + Gemm gemm; + + auto arguments = args_from_options(D, A, B, A_sf, B_sf, alpha, M, N, K); + + size_t workspace_size = Gemm::get_workspace_size(arguments); + auto const workspace_options = + torch::TensorOptions().dtype(torch::kUInt8).device(A.device()); + auto workspace = torch::empty(workspace_size, workspace_options); + + CUTLASS_CHECK(gemm.can_implement(arguments)); + + CUTLASS_CHECK(gemm.initialize(arguments, workspace.data_ptr(), stream)); + + CUTLASS_CHECK(gemm.run(arguments, workspace.data_ptr(), stream)); +} + +void cutlass_fp4_bf16_gemm_dispatch(torch::Tensor& D, torch::Tensor const& A, + torch::Tensor const& B, + torch::Tensor const& A_sf, + torch::Tensor const& B_sf, + torch::Tensor const& alpha, int m, int n, + int k, cudaStream_t stream) { + uint32_t const mp2 = std::max(static_cast(16), next_pow_2(m)); + if (mp2 <= 256) { + runGemm::Gemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } else { + runGemm::Gemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } +} + +void cutlass_fp4_f16_gemm_dispatch(torch::Tensor& D, torch::Tensor const& A, + torch::Tensor const& B, + torch::Tensor const& A_sf, + torch::Tensor const& B_sf, + torch::Tensor const& alpha, int m, int n, + int k, cudaStream_t stream) { + uint32_t const mp2 = std::max(static_cast(16), next_pow_2(m)); + if (mp2 <= 256) { + runGemm::Gemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } else { + runGemm::Gemm>( + D, A, B, A_sf, B_sf, alpha, m, n, k, stream); + } +} + +void cutlass_scaled_fp4_mm_sm120a(torch::Tensor& D, torch::Tensor const& A, + torch::Tensor const& B, + torch::Tensor const& A_sf, + torch::Tensor const& B_sf, + torch::Tensor const& alpha) { +#if defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) + CHECK_INPUT(A, FLOAT4_E2M1X2, "a"); + CHECK_INPUT(B, FLOAT4_E2M1X2, "b"); + + CHECK_INPUT(A_sf, SF_DTYPE, "scale_a"); + CHECK_INPUT(B_sf, SF_DTYPE, "scale_b"); + + CHECK_INPUT(alpha, at::ScalarType::Float, "alpha"); + + TORCH_CHECK(A.dim() == 2, "a must be a matrix"); + TORCH_CHECK(B.dim() == 2, "b must be a matrix"); + TORCH_CHECK(A.sizes()[1] == B.sizes()[1], + "a and b shapes cannot be multiplied (", A.sizes()[0], "x", + A.sizes()[1], " and ", B.sizes()[0], "x", B.sizes()[1], ")"); + + auto const m = A.sizes()[0]; + auto const n = B.sizes()[0]; + auto const k = A.sizes()[1] * 2; + + constexpr int alignment = 32; + TORCH_CHECK(k % alignment == 0, "Expected k to be divisible by ", alignment, + ", but got a shape: (", A.sizes()[0], "x", A.sizes()[1], + "), k: ", k, "."); + TORCH_CHECK(n % alignment == 0, "Expected n to be divisible by ", alignment, + ", but got b shape: (", B.sizes()[0], "x", B.sizes()[1], ")."); + + auto round_up = [](int x, int y) { return (x + y - 1) / y * y; }; + int rounded_m = round_up(m, 128); + int rounded_n = round_up(n, 128); + // Since k is divisible by 32 (alignment), k / 16 is guaranteed to be an + // integer. + int rounded_k = round_up(k / 16, 4); + + TORCH_CHECK(A_sf.dim() == 2, "scale_a must be a matrix"); + TORCH_CHECK(B_sf.dim() == 2, "scale_b must be a matrix"); + TORCH_CHECK(A_sf.sizes()[1] == B_sf.sizes()[1], + "scale_a and scale_b shapes cannot be multiplied (", + A_sf.sizes()[0], "x", A_sf.sizes()[1], " and ", B_sf.sizes()[0], + "x", B_sf.sizes()[1], ")"); + TORCH_CHECK(A_sf.sizes()[0] == rounded_m && A_sf.sizes()[1] == rounded_k, + "scale_a must be padded and swizzled to a shape (", rounded_m, + "x", rounded_k, "), but got a shape (", A_sf.sizes()[0], "x", + A_sf.sizes()[1], ")"); + TORCH_CHECK(B_sf.sizes()[0] == rounded_n && B_sf.sizes()[1] == rounded_k, + "scale_b must be padded and swizzled to a shape (", rounded_n, + "x", rounded_k, "), but got a shape (", B_sf.sizes()[0], "x", + B_sf.sizes()[1], ")"); + + auto out_dtype = D.dtype(); + const at::cuda::OptionalCUDAGuard device_guard(device_of(A)); + const cudaStream_t stream = at::cuda::getCurrentCUDAStream(A.get_device()); + + if (out_dtype == at::ScalarType::BFloat16) { + return cutlass_fp4_bf16_gemm_dispatch(D, A, B, A_sf, B_sf, alpha, m, n, k, + stream); + } else if (out_dtype == at::ScalarType::Half) { + return cutlass_fp4_f16_gemm_dispatch(D, A, B, A_sf, B_sf, alpha, m, n, k, + stream); + } else { + TORCH_CHECK(false, "Unsupported output data type of nvfp4 mm sm120 (", + out_dtype, ")"); + } +#else + TORCH_CHECK(false, + "Unsupported CUTLASS version. Set VLLM_CUTLASS_SRC_DIR to " + "a CUTLASS 3.8 source directory to enable support."); +#endif // defined(CUTLASS_ARCH_MMA_SM120_SUPPORTED) +} \ No newline at end of file diff --git a/csrc/quantization/fp8/common.cu b/csrc/quantization/fp8/common.cu index 0e1eab66f0b98..5fe5dd04bd891 100644 --- a/csrc/quantization/fp8/common.cu +++ b/csrc/quantization/fp8/common.cu @@ -1,7 +1,8 @@ #include "common.cuh" #include "dispatch_utils.h" - +#include "../vectorization_utils.cuh" #include +#include #ifndef USE_ROCM #include @@ -12,74 +13,127 @@ namespace vllm { template -__global__ void scaled_fp8_quant_kernel(fp8_type* __restrict__ out, - const scalar_t* __restrict__ input, - const float* __restrict__ scale, - int64_t num_elems) { - int tid = blockDim.x * blockIdx.x + threadIdx.x; +__global__ void scaled_fp8_quant_kernel_strided( + fp8_type* __restrict__ out, const scalar_t* __restrict__ input, + const float* __restrict__ scale, int hidden_size, int64_t in_row_stride, + int64_t out_row_stride) { + const int64_t token_idx = blockIdx.x; // one token per block + const int tid = threadIdx.x; - // Invert the scale so that we can use multiplications to avoid expensive - // division. - const float inverted_scale = 1.0f / (*scale); - scaled_fp8_conversion_vec( - out, input, inverted_scale, num_elems, tid, blockDim.x * gridDim.x); + const scalar_t* token_in = input + token_idx * in_row_stride; + fp8_type* token_out = out + token_idx * out_row_stride; + + const float inv_scale = 1.0f / (*scale); + + vectorize_with_alignment<16>( + token_in, token_out, hidden_size, tid, blockDim.x, + [=] __device__(fp8_type & dst, const scalar_t& src) { + dst = scaled_fp8_conversion(static_cast(src), + inv_scale); + }); } template -__global__ void dynamic_per_token_scaled_fp8_quant_kernel( - fp8_type* __restrict__ out, float* __restrict__ scale, - scalar_t const* __restrict__ input, float const* __restrict__ scale_ub, - const int hidden_size) { - int const tid = threadIdx.x; - int const token_idx = blockIdx.x; +__global__ void segmented_max_reduction_strided( + float* __restrict__ scale, const scalar_t* __restrict__ input, + int hidden_size, int64_t in_row_stride, int64_t num_tokens) { + __shared__ float cache[256]; + const int tid = threadIdx.x; + int64_t token_idx = blockIdx.x; - // Use int64 to avoid overflowing an int32 when calculating this offset - int64_t offset = static_cast(token_idx) * hidden_size; - scalar_t const* __restrict__ token_input = &input[offset]; - fp8_type* __restrict__ token_output = &out[offset]; - - // For vectorization, token_input and token_output pointers need to be - // aligned at 32-byte and 16-byte addresses respectively. - bool const can_vectorize = hidden_size % 16 == 0; - - float absmax_val = 0.0f; - if (can_vectorize) { - absmax_val = thread_max_vec(token_input, hidden_size, tid, blockDim.x); - } else { - for (int i = tid; i < hidden_size; i += blockDim.x) { - float const x = static_cast(token_input[i]); - absmax_val = fmaxf(absmax_val, fabsf(x)); - } + // one block per token. Guard in case gridDim.x > num_tokens. + if (token_idx >= num_tokens) { + return; } + const scalar_t* row_ptr = input + token_idx * in_row_stride; + + // each thread scans elements of the row in a strided fashion. + float thread_max = 0.0f; + for (int e = tid; e < hidden_size; e += blockDim.x) { + float v = fabsf(static_cast(row_ptr[e])); + thread_max = fmaxf(thread_max, v); + } + + cache[tid] = thread_max; + __syncthreads(); + + // parallel reduction to find row max. + for (int offset = blockDim.x / 2; offset > 0; offset >>= 1) { + if (tid < offset) { + cache[tid] = fmaxf(cache[tid], cache[tid + offset]); + } + __syncthreads(); + } + + // thread 0 updates global scale (per-tensor) atomically. + if (tid == 0) { + atomicMaxFloat(scale, cache[0] / quant_type_max_v); + } +} + +template +__global__ void scaled_fp8_quant_kernel_strided_dynamic( + fp8_type* __restrict__ out, const scalar_t* __restrict__ input, + const float* __restrict__ scale, int hidden_size, int64_t in_row_stride, + int64_t out_row_stride) { + const int64_t token_idx = blockIdx.x; + const int tid = threadIdx.x; + + const scalar_t* token_in = input + token_idx * in_row_stride; + fp8_type* token_out = out + token_idx * out_row_stride; + + const float reciprocal_scale = 1.0f / (*scale); + vectorize_with_alignment<16>( + token_in, token_out, hidden_size, tid, blockDim.x, + [=] __device__(fp8_type & dst, const scalar_t& src) { + dst = scaled_fp8_conversion(static_cast(src), + reciprocal_scale); + }); +} + +template +__global__ void dynamic_per_token_scaled_fp8_quant_kernel_strided( + fp8_type* __restrict__ out, float* __restrict__ scale, + const scalar_t* __restrict__ input, const float* __restrict__ scale_ub, + int hidden_size, int64_t in_row_stride, int64_t out_row_stride) { + const int64_t token_idx = blockIdx.x; + const int tid = threadIdx.x; + + // Use int64 to avoid overflowing an int32 when calculating this offset + int64_t in_offset = static_cast(token_idx) * in_row_stride; + int64_t out_offset = static_cast(token_idx) * out_row_stride; + const scalar_t* token_in = input + in_offset; + fp8_type* token_out = out + out_offset; + + // 1) per-token absmax + float absmax_val = 0.f; + vectorize_read_with_alignment<16>( + token_in, hidden_size, tid, blockDim.x, [&] __device__(scalar_t v) { + absmax_val = fmaxf(absmax_val, fabsf(static_cast(v))); + }); + using BlockReduce = cub::BlockReduce; - __shared__ typename BlockReduce::TempStorage reduceStorage; - float const block_absmax_val_maybe = - BlockReduce(reduceStorage).Reduce(absmax_val, cub::Max{}, blockDim.x); + __shared__ typename BlockReduce::TempStorage tmp; + const float block_max = + BlockReduce(tmp).Reduce(absmax_val, cub::Max{}, blockDim.x); + __shared__ float token_scale; if (tid == 0) { - if (scale_ub) { - token_scale = fminf(block_absmax_val_maybe, *scale_ub); - } else { - token_scale = block_absmax_val_maybe; - } - // token scale computation + token_scale = scale_ub ? fminf(block_max, *scale_ub) : block_max; token_scale = fmaxf(token_scale / quant_type_max_v, min_scaling_factor::val()); scale[token_idx] = token_scale; } __syncthreads(); - // Note that we don't use inverted scales so we can match FBGemm impl. - if (can_vectorize) { - scaled_fp8_conversion_vec( - token_output, token_input, token_scale, hidden_size, tid, blockDim.x); - } else { - for (int i = tid; i < hidden_size; i += blockDim.x) { - token_output[i] = scaled_fp8_conversion( - static_cast(token_input[i]), token_scale); - } - } + // 2) quantize + vectorize_with_alignment<16>( + token_in, token_out, hidden_size, tid, blockDim.x, + [=] __device__(fp8_type & dst, const scalar_t& src) { + dst = scaled_fp8_conversion(static_cast(src), + token_scale); + }); } } // namespace vllm @@ -88,23 +142,31 @@ void static_scaled_fp8_quant(torch::Tensor& out, // [..., d] torch::Tensor const& input, // [..., d] torch::Tensor const& scale) // [1] { - TORCH_CHECK(input.is_contiguous()); - TORCH_CHECK(out.is_contiguous()); - int const block_size = 256; - int const num_tokens = input.numel() / input.size(-1); - int const num_elems = input.numel(); - dim3 const grid(num_tokens); - dim3 const block(block_size); + TORCH_CHECK(input.stride(-1) == 1, + "last dimension of input must be contiguous"); + TORCH_CHECK(out.stride(-1) == 1, + "last dimension of output must be contiguous"); + + const int hidden_size = input.size(-1); + const int num_tokens = input.numel() / hidden_size; + const int block_size = 256; + dim3 grid(num_tokens); + dim3 block(block_size); + + const int64_t in_row_stride = input.stride(-2); + const int64_t out_row_stride = out.stride(-2); + const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); VLLM_DISPATCH_FLOATING_TYPES( input.scalar_type(), "scaled_fp8_quant_kernel_scalar_type", [&] { VLLM_DISPATCH_FP8_TYPES( out.scalar_type(), "scaled_fp8_quant_kernel_fp8_type", [&] { - vllm::scaled_fp8_quant_kernel + vllm::scaled_fp8_quant_kernel_strided <<>>( out.data_ptr(), input.data_ptr(), - scale.data_ptr(), num_elems); + scale.data_ptr(), hidden_size, in_row_stride, + out_row_stride); }); }); } @@ -113,27 +175,42 @@ void dynamic_scaled_fp8_quant(torch::Tensor& out, // [..., d] torch::Tensor const& input, // [..., d] torch::Tensor& scale) // [1] { - TORCH_CHECK(input.is_contiguous()); - TORCH_CHECK(out.is_contiguous()); - int const block_size = 256; - int const num_tokens = input.numel() / input.size(-1); - int const num_elems = input.numel(); - dim3 const grid(num_tokens); - dim3 const block(block_size); + TORCH_CHECK(input.stride(-1) == 1, + "last dimension of input must be contiguous"); + TORCH_CHECK(out.stride(-1) == 1, + "last dimension of output must be contiguous"); + + const int hidden_size = input.size(-1); + const int num_tokens = input.numel() / hidden_size; + const int block_size = 256; + dim3 grid(num_tokens); + dim3 block(block_size); + + const int64_t in_row_stride = input.stride(-2); + const int64_t out_row_stride = out.stride(-2); + const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); + + // scale tensor should be initialised to <=0 before reduction + AT_CUDA_CHECK( + cudaMemsetAsync(scale.data_ptr(), 0, sizeof(float), stream)); + VLLM_DISPATCH_FLOATING_TYPES( input.scalar_type(), "scaled_fp8_quant_kernel_scalar_type", [&] { VLLM_DISPATCH_FP8_TYPES( out.scalar_type(), "scaled_fp8_quant_kernel_fp8_type", [&] { - vllm::segmented_max_reduction - <<>>(scale.data_ptr(), - input.data_ptr(), - num_elems); - vllm::scaled_fp8_quant_kernel + vllm::segmented_max_reduction_strided + <<>>( + scale.data_ptr(), input.data_ptr(), + hidden_size, in_row_stride, + static_cast(num_tokens)); + + vllm::scaled_fp8_quant_kernel_strided_dynamic <<>>( out.data_ptr(), input.data_ptr(), - scale.data_ptr(), num_elems); + scale.data_ptr(), hidden_size, in_row_stride, + out_row_stride); }); }); } @@ -142,14 +219,19 @@ void dynamic_per_token_scaled_fp8_quant( torch::Tensor& out, // [..., d] torch::Tensor const& input, // [..., d] torch::Tensor& scales, std::optional const& scale_ub) { - TORCH_CHECK(input.is_contiguous()); - TORCH_CHECK(out.is_contiguous()); + TORCH_CHECK(input.stride(-1) == 1, + "last dimension of input must be contiguous"); + TORCH_CHECK(out.stride(-1) == 1, + "last dimension of output must be contiguous"); - int const hidden_size = input.size(-1); - int const num_tokens = input.numel() / hidden_size; - int const block_size = 256; - dim3 const grid(num_tokens); - dim3 const block(std::min(hidden_size, block_size)); + const int hidden_size = input.size(-1); + const int num_tokens = input.numel() / hidden_size; + const int block_size = 256; + dim3 grid(num_tokens); + dim3 block(std::min(hidden_size, block_size)); + + const int64_t in_row_stride = input.stride(-2); + const int64_t out_row_stride = out.stride(-2); const at::cuda::OptionalCUDAGuard device_guard(device_of(input)); const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); @@ -159,13 +241,12 @@ void dynamic_per_token_scaled_fp8_quant( VLLM_DISPATCH_FP8_TYPES( out.scalar_type(), "dynamic_per_token_scaled_fp8_quant_kernel_fp8_type", [&] { - vllm::dynamic_per_token_scaled_fp8_quant_kernel - <<>>( - out.data_ptr(), scales.data_ptr(), - input.data_ptr(), - scale_ub.has_value() ? scale_ub->data_ptr() - : nullptr, - hidden_size); + vllm::dynamic_per_token_scaled_fp8_quant_kernel_strided< + scalar_t, fp8_t><<>>( + out.data_ptr(), scales.data_ptr(), + input.data_ptr(), + scale_ub.has_value() ? scale_ub->data_ptr() : nullptr, + hidden_size, in_row_stride, out_row_stride); }); }); } diff --git a/csrc/quantization/fp8/common.cuh b/csrc/quantization/fp8/common.cuh index d36f94a8f10d6..1aad6330c44b8 100644 --- a/csrc/quantization/fp8/common.cuh +++ b/csrc/quantization/fp8/common.cuh @@ -55,111 +55,4 @@ __device__ __forceinline__ fp8_type scaled_fp8_conversion(float const val, #endif } -// Compute the absolute maximum m of the input tensor and store -// m / float8_e4m3::max() in *scale. Each thread block performs a -// reduction tree and the memory in scale is atomically updated. -// So to get the right answer, *scale needs to be initialized to -// a value <= 0.0 and we need to wait for all thread blocks to -// finish before consuming *scale. -template -__global__ void segmented_max_reduction(float* __restrict__ scale, - const scalar_t* __restrict__ input, - int64_t num_elems) { - __shared__ float cache[256]; - int64_t i = blockDim.x * blockIdx.x + threadIdx.x; - - // First store maximum for all values processes by - // the current thread in cache[threadIdx.x] - scalar_t tmp = 0.0; - while (i < num_elems) { - float x = static_cast(input[i]); - tmp = fmaxf(tmp, fabsf(x)); - i += blockDim.x * gridDim.x; - } - cache[threadIdx.x] = tmp; - - __syncthreads(); - - // Now perform parallel reduction within the thread block - int ib = blockDim.x / 2; - while (ib != 0) { - if (threadIdx.x < ib && cache[threadIdx.x + ib] > cache[threadIdx.x]) { - cache[threadIdx.x] = cache[threadIdx.x + ib]; - } - __syncthreads(); - ib /= 2; - } - // Finally, since cache[0] contains the maximum for this thread block, - // atomically write the max to the target location - if (threadIdx.x == 0) { - atomicMaxFloat(scale, cache[0] / quant_type_max_v); - } -} - -template -__device__ float thread_max_vec(scalar_t const* __restrict__ input, - int64_t const num_elems, int const tid, - int const step) { - constexpr size_t VEC_SIZE = 16; - using scalarxN_t = vec_n_t; - // Vectorized input/output to better utilize memory bandwidth. - auto const* vectorized_in = reinterpret_cast(input); - - // num_elems / VEC_SIZE (which is 16) - int64_t const num_vec_elems = num_elems >> 4; - float absmax_val = 0.0f; - -#pragma unroll - for (int64_t i = tid; i < num_vec_elems; i += step) { - scalarxN_t in_vec = vectorized_in[i]; -#pragma unroll - for (int j = 0; j < VEC_SIZE; ++j) { - absmax_val = fmaxf(absmax_val, fabsf(in_vec.val[j])); - } - } - - // Handle the remaining elements if num_elems is not divisible by VEC_SIZE - for (int64_t i = num_vec_elems * VEC_SIZE + tid; i < num_elems; i += step) { - absmax_val = fmaxf(absmax_val, fabsf(input[i])); - } - - return absmax_val; -} - -template -__device__ void scaled_fp8_conversion_vec(fp8_type* __restrict__ out, - scalar_t const* __restrict__ input, - float const scale, - int64_t const num_elems, - int const tid, int const step) { - constexpr size_t VEC_SIZE = 16; - using scalarxN_t = vec_n_t; - using float8xN_t = q8_n_t; - // Vectorized input/output to better utilize memory bandwidth. - auto const* vectorized_in = reinterpret_cast(input); - auto* vectorized_out = reinterpret_cast(out); - - // num_elems / VEC_SIZE (which is 16) - int64_t const num_vec_elems = num_elems >> 4; - -#pragma unroll - for (int64_t i = tid; i < num_vec_elems; i += step) { - scalarxN_t in_vec = vectorized_in[i]; - float8xN_t out_vec; - -#pragma unroll - for (int j = 0; j < VEC_SIZE; ++j) { - out_vec.val[j] = scaled_fp8_conversion( - static_cast(in_vec.val[j]), scale); - } - vectorized_out[i] = out_vec; - } - - // Handle the remaining elements if num_elems is not divisible by VEC_SIZE - for (int64_t i = num_vec_elems * VEC_SIZE + tid; i < num_elems; i += step) { - out[i] = scaled_fp8_conversion( - static_cast(input[i]), scale); - } -} - } // namespace vllm diff --git a/csrc/quantization/fp8/per_token_group_quant.cu b/csrc/quantization/fp8/per_token_group_quant.cu index 2609054f2072b..f5b40e35b6e5a 100644 --- a/csrc/quantization/fp8/per_token_group_quant.cu +++ b/csrc/quantization/fp8/per_token_group_quant.cu @@ -1,12 +1,10 @@ #include -#include #include "../per_token_group_quant_8bit.h" #include -#include -#include +#include #include @@ -199,7 +197,7 @@ void per_token_group_quant_8bit(const torch::Tensor& input, VLLM_DISPATCH_FLOATING_TYPES( input.scalar_type(), "per_token_group_quant_8bit", ([&] { if (dst_type == at::ScalarType::Float8_e4m3fn) { - LAUNCH_KERNEL(scalar_t, c10::Float8_e4m3fn); + LAUNCH_KERNEL(scalar_t, __nv_fp8_e4m3); } else if (dst_type == at::ScalarType::Char) { LAUNCH_KERNEL(scalar_t, int8_t); } diff --git a/docker/Dockerfile b/docker/Dockerfile index b87401c593572..d444087a3eff7 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -1,4 +1,3 @@ - # The vLLM Dockerfile is used to construct vLLM image that can be directly used # to run the OpenAI compatible server. @@ -16,6 +15,7 @@ ARG PYTHON_VERSION=3.12 # Example: # docker build --build-arg BUILD_BASE_IMAGE=registry.acme.org/mirror/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 ARG BUILD_BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 +# TODO: Restore to base image after FlashInfer AOT wheel fixed ARG FINAL_BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 # By parameterizing the Deadsnakes repository URL, we allow third-party to use @@ -119,6 +119,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \ # Reference: https://github.com/astral-sh/uv/pull/1694 ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" +# Use copy mode to avoid hardlink failures with Docker cache mounts +ENV UV_LINK_MODE=copy # Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519 # as it was causing spam when compiling the CUTLASS kernels @@ -164,9 +166,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \ # see https://github.com/pytorch/pytorch/pull/123243 ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list} -# Override the arch list for flash-attn to reduce the binary size -ARG vllm_fa_cmake_gpu_arches='80-real;90-real' -ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches} #################### BASE BUILD IMAGE #################### #################### WHEEL BUILD IMAGE #################### @@ -184,6 +183,8 @@ COPY requirements/build.txt requirements/build.txt # Reference: https://github.com/astral-sh/uv/pull/1694 ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" +# Use copy mode to avoid hardlink failures with Docker cache mounts +ENV UV_LINK_MODE=copy RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --system -r requirements/build.txt \ @@ -275,6 +276,8 @@ ARG PYTORCH_CUDA_INDEX_BASE_URL # Reference: https://github.com/astral-sh/uv/pull/1694 ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" +# Use copy mode to avoid hardlink failures with Docker cache mounts +ENV UV_LINK_MODE=copy COPY requirements/lint.txt requirements/lint.txt COPY requirements/test.txt requirements/test.txt @@ -286,7 +289,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \ #################### vLLM installation IMAGE #################### # image with vLLM installed -# TODO: Restore to base image after FlashInfer AOT wheel fixed FROM ${FINAL_BASE_IMAGE} AS vllm-base ARG CUDA_VERSION ARG PYTHON_VERSION @@ -345,6 +347,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \ # Reference: https://github.com/astral-sh/uv/pull/1694 ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" +# Use copy mode to avoid hardlink failures with Docker cache mounts +ENV UV_LINK_MODE=copy # Workaround for https://github.com/openai/triton/issues/2507 and # https://github.com/pytorch/pytorch/issues/107960 -- hopefully @@ -386,7 +390,9 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist # Install FlashInfer from source ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git" -ARG FLASHINFER_GIT_REF="v0.2.9rc2" +# Keep this in sync with https://github.com/vllm-project/vllm/blob/main/requirements/cuda.txt +# We use `--force-reinstall --no-deps` to avoid issues with the existing FlashInfer wheel. +ARG FLASHINFER_GIT_REF="v0.2.9" RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' . /etc/environment git clone --depth 1 --recursive --shallow-submodules \ @@ -408,7 +414,7 @@ RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ python3 -m flashinfer.aot TORCH_CUDA_ARCH_LIST="${FI_TORCH_CUDA_ARCH_LIST}" \ - uv pip install --system --no-build-isolation . + uv pip install --system --no-build-isolation --force-reinstall --no-deps . popd rm -rf flashinfer BASH @@ -429,6 +435,33 @@ RUN --mount=type=cache,target=/root/.cache/uv \ uv pip install --system -r requirements/build.txt \ --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') +# Install DeepGEMM from source +ARG DEEPGEMM_GIT_REPO="https://github.com/deepseek-ai/DeepGEMM.git" +ARG DEEPGEMM_GIT_REF="187656694f7f69e3e7975617a68bc3387680a7e1" +RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH' + . /etc/environment + CUDA_MAJOR="${CUDA_VERSION%%.*}" + CUDA_MINOR="${CUDA_VERSION#${CUDA_MAJOR}.}" + CUDA_MINOR="${CUDA_MINOR%%.*}" + if [ "$CUDA_MAJOR" -ge 12 ] && [ "$CUDA_MINOR" -ge 8 ]; then + git clone --recursive --shallow-submodules \ + ${DEEPGEMM_GIT_REPO} deepgemm + echo "🏗️ Building DeepGEMM" + pushd deepgemm + git checkout ${DEEPGEMM_GIT_REF} + # Build DeepGEMM + # (Based on https://github.com/deepseek-ai/DeepGEMM/blob/main/install.sh) + rm -rf build dist + rm -rf *.egg-info + python3 setup.py bdist_wheel + uv pip install --system dist/*.whl + popd + rm -rf deepgemm + else + echo "Skipping DeepGEMM installation (requires CUDA 12.8+ but got ${CUDA_VERSION})" + fi +BASH + #################### vLLM installation IMAGE #################### #################### TEST IMAGE #################### @@ -447,6 +480,8 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL # Reference: https://github.com/astral-sh/uv/pull/1694 ENV UV_HTTP_TIMEOUT=500 ENV UV_INDEX_STRATEGY="unsafe-best-match" +# Use copy mode to avoid hardlink failures with Docker cache mounts +ENV UV_LINK_MODE=copy # install development dependencies (for testing) RUN --mount=type=cache,target=/root/.cache/uv \ diff --git a/docker/Dockerfile.nightly_torch b/docker/Dockerfile.nightly_torch index 8d43de77aad59..e147b97f0e056 100644 --- a/docker/Dockerfile.nightly_torch +++ b/docker/Dockerfile.nightly_torch @@ -114,9 +114,6 @@ RUN cat torch_build_versions.txt # explicitly set the list to avoid issues with torch 2.2 # see https://github.com/pytorch/pytorch/pull/123243 -# Override the arch list for flash-attn to reduce the binary size -ARG vllm_fa_cmake_gpu_arches='80-real;90-real' -ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches} #################### BASE BUILD IMAGE #################### #################### WHEEL BUILD IMAGE #################### diff --git a/docker/Dockerfile.tpu b/docker/Dockerfile.tpu index b9fc9def88190..2190151369761 100644 --- a/docker/Dockerfile.tpu +++ b/docker/Dockerfile.tpu @@ -1,4 +1,4 @@ -ARG NIGHTLY_DATE="20250724" +ARG NIGHTLY_DATE="20250730" ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.12_tpuvm_$NIGHTLY_DATE" FROM $BASE_IMAGE diff --git a/docs/.nav.yml b/docs/.nav.yml index ab54dc3e535bd..ad742be3d6947 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -56,9 +56,7 @@ nav: - contributing/model/tests.md - contributing/model/multimodal.md - CI: contributing/ci - - Design Documents: - - V0: design - - V1: design/v1 + - Design Documents: design - API Reference: - Summary: api/README.md - Contents: diff --git a/docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png b/docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png new file mode 100644 index 0000000000000..5721d5582c7f1 Binary files /dev/null and b/docs/assets/design/fused_moe_modular_kernel/fused_experts_blocks.png differ diff --git a/docs/assets/design/fused_moe_modular_kernel/fused_moe_batched.png b/docs/assets/design/fused_moe_modular_kernel/fused_moe_batched.png new file mode 100644 index 0000000000000..8168155b9dbaf Binary files /dev/null and b/docs/assets/design/fused_moe_modular_kernel/fused_moe_batched.png differ diff --git a/docs/assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png b/docs/assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png new file mode 100644 index 0000000000000..bc6cc0aaaf47b Binary files /dev/null and b/docs/assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png differ diff --git a/docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png b/docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png new file mode 100644 index 0000000000000..94364e593fe68 Binary files /dev/null and b/docs/assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png differ diff --git a/docs/assets/design/v1/metrics/intervals-1.png b/docs/assets/design/metrics/intervals-1.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-1.png rename to docs/assets/design/metrics/intervals-1.png diff --git a/docs/assets/design/v1/metrics/intervals-2.png b/docs/assets/design/metrics/intervals-2.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-2.png rename to docs/assets/design/metrics/intervals-2.png diff --git a/docs/assets/design/v1/metrics/intervals-3.png b/docs/assets/design/metrics/intervals-3.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-3.png rename to docs/assets/design/metrics/intervals-3.png diff --git a/docs/assets/kernel/k_vecs.png b/docs/assets/design/paged_attention/k_vecs.png similarity index 100% rename from docs/assets/kernel/k_vecs.png rename to docs/assets/design/paged_attention/k_vecs.png diff --git a/docs/assets/kernel/key.png b/docs/assets/design/paged_attention/key.png similarity index 100% rename from docs/assets/kernel/key.png rename to docs/assets/design/paged_attention/key.png diff --git a/docs/assets/kernel/logits_vec.png b/docs/assets/design/paged_attention/logits_vec.png similarity index 100% rename from docs/assets/kernel/logits_vec.png rename to docs/assets/design/paged_attention/logits_vec.png diff --git a/docs/assets/kernel/q_vecs.png b/docs/assets/design/paged_attention/q_vecs.png similarity index 100% rename from docs/assets/kernel/q_vecs.png rename to docs/assets/design/paged_attention/q_vecs.png diff --git a/docs/assets/kernel/query.png b/docs/assets/design/paged_attention/query.png similarity index 100% rename from docs/assets/kernel/query.png rename to docs/assets/design/paged_attention/query.png diff --git a/docs/assets/kernel/v_vec.png b/docs/assets/design/paged_attention/v_vec.png similarity index 100% rename from docs/assets/kernel/v_vec.png rename to docs/assets/design/paged_attention/v_vec.png diff --git a/docs/assets/kernel/value.png b/docs/assets/design/paged_attention/value.png similarity index 100% rename from docs/assets/kernel/value.png rename to docs/assets/design/paged_attention/value.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-1.png b/docs/assets/design/prefix_caching/example-time-1.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-1.png rename to docs/assets/design/prefix_caching/example-time-1.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-3.png b/docs/assets/design/prefix_caching/example-time-3.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-3.png rename to docs/assets/design/prefix_caching/example-time-3.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-4.png b/docs/assets/design/prefix_caching/example-time-4.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-4.png rename to docs/assets/design/prefix_caching/example-time-4.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-5.png b/docs/assets/design/prefix_caching/example-time-5.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-5.png rename to docs/assets/design/prefix_caching/example-time-5.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-6.png b/docs/assets/design/prefix_caching/example-time-6.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-6.png rename to docs/assets/design/prefix_caching/example-time-6.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-7.png b/docs/assets/design/prefix_caching/example-time-7.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-7.png rename to docs/assets/design/prefix_caching/example-time-7.png diff --git a/docs/assets/design/v1/prefix_caching/free.png b/docs/assets/design/prefix_caching/free.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/free.png rename to docs/assets/design/prefix_caching/free.png diff --git a/docs/assets/design/v1/prefix_caching/overview.png b/docs/assets/design/prefix_caching/overview.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/overview.png rename to docs/assets/design/prefix_caching/overview.png diff --git a/docs/assets/design/tpu/most_model_len.png b/docs/assets/design/tpu/most_model_len.png new file mode 100644 index 0000000000000..344a81ed90801 Binary files /dev/null and b/docs/assets/design/tpu/most_model_len.png differ diff --git a/docs/cli/README.md b/docs/cli/README.md index dfb6051a8c8a6..b1371c82a4c4d 100644 --- a/docs/cli/README.md +++ b/docs/cli/README.md @@ -6,13 +6,13 @@ toc_depth: 4 The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: -``` +```bash vllm --help ``` Available Commands: -``` +```bash vllm {chat,complete,serve,bench,collect-env,run-batch} ``` diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md new file mode 100644 index 0000000000000..a2941c80bd27c --- /dev/null +++ b/docs/configuration/tpu.md @@ -0,0 +1,111 @@ +# TPU Optimization Tips + +This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload. + +## Get started + +Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md). + +### TPU workload sizing + +When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed. + +The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you: + +- KV cache size requirement per token and per request +- TPU/GPU memory consumed by the model weights +- TPU/GPU memory allocated for the KV cache +- Maximum \# of requests you can approximately set (--max-num-seqs) + +This approach serves as a general rule of thumb. + +#### Latency-throughput tradeoff + +As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency. + +`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request. + +Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload. + +In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput. + +#### Compilation and Caching + +Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process. + +To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used. + +Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs. + +Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). + +#### Reducing compilation time + +This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. + +### Optimize based on your data + +#### max model len vs. most model len + +![most_model_len](../assets/design/tpu/most_model_len.png) + +If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. + +For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`. + +The requests get subdivided into max-model-len and most-model-len categories, for the latter category, we can gain better performance since the server can process more requests at a time. + +#### Padding + +For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128: 128, 256, etc. + +The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about tpu padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests: + +1) the default exponential padding (pad to the nearest power of 2) +2) bucket padding (pad to the nearest linearly increasing bucket). + +When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`. + +For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512]. + +The fewer tokens we pad, the less unnecessary computation TPU does, the better performance we can get. For example, if num_tokens=300, with exponential padding, we pad to 512, with the bucket_padding above, we pad to 320. + +However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. + +#### Quantization + +If possible, use the precision that matches the chip’s hardware acceleration: + +- v5e has int4/int8 hardware acceleration in the MXU +- v6e has int4/int8 hardware acceleration in the MXU + +Supported quantized formats and features in vLLM on TPU [Jul '25]: + +- INT8 W8A8 +- INT8 W8A16 +- FP8 KV cache +- [WIP] FP8 W8A8 +- [WIP] AWQ +- [WIP] FP4 W4A8 + +#### Parallelization + +Don't set TP to be less than the number of chips on a single-host deployment. + +Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). + +### Tune your workloads + +Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. + +### Future Topics We'll Cover + +#### Profiling + +The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance. + +#### SPMD + +More details to come. + +**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** diff --git a/docs/contributing/README.md b/docs/contributing/README.md index e3ae5055b9988..5a2a70d57e85f 100644 --- a/docs/contributing/README.md +++ b/docs/contributing/README.md @@ -26,6 +26,8 @@ See . ## Developing +--8<-- "docs/getting_started/installation/python_env_setup.inc.md" + Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source][build-from-source] documentation for details. @@ -42,7 +44,7 @@ For an optimized workflow when iterating on C++/CUDA kernels, see the [Increment Install MkDocs along with the [plugins](https://github.com/vllm-project/vllm/blob/main/mkdocs.yaml) used in the vLLM documentation, as well as required dependencies: ```bash -pip install -r requirements/docs.txt +uv pip install -r requirements/docs.txt ``` !!! note @@ -98,13 +100,14 @@ For additional features and advanced configurations, refer to the official [MkDo ??? console "Commands" ```bash - pip install -r requirements/common.txt -r requirements/dev.txt + # These commands are only for Nvidia CUDA platforms. + uv pip install -r requirements/common.txt -r requirements/dev.txt --torch-backend=auto # Linting, formatting and static type checking - pre-commit install --hook-type pre-commit --hook-type commit-msg + pre-commit install # You can manually run pre-commit with - pre-commit run --all-files + pre-commit run --all-files --show-diff-on-failure # To manually run something from CI that does not run # locally by default, you can run: @@ -122,6 +125,10 @@ For additional features and advanced configurations, refer to the official [MkDo Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment. +!!! note "Install python3-dev if Python.h is missing" + If any of the above commands fails with `Python.h: No such file or directory`, install + `python3-dev` with `sudo apt install python3-dev`. + !!! note Currently, the repository is not fully checked by `mypy`. @@ -153,7 +160,7 @@ Using `-s` with `git commit` will automatically add this header. !!! tip You can enable automatic sign-off via your IDE: - + - **PyCharm**: Click on the `Show Commit Options` icon to the right of the `Commit and Push...` button in the `Commit` window. It will bring up a `git` window where you can modify the `Author` and enable `Sign-off commit`. - **VSCode**: Open the [Settings editor](https://code.visualstudio.com/docs/configure/settings) diff --git a/docs/contributing/ci/failures.md b/docs/contributing/ci/failures.md index 573efb3b05f6e..d7e2dfbca8760 100644 --- a/docs/contributing/ci/failures.md +++ b/docs/contributing/ci/failures.md @@ -20,19 +20,19 @@ the failure? - **Use this title format:** - ``` + ```text [CI Failure]: failing-test-job - regex/matching/failing:test ``` - **For the environment field:** - ``` - Still failing on main as of commit abcdef123 + ```text + Still failing on main as of commit abcdef123 ``` - **In the description, include failing tests:** - ``` + ```text FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test2 - Failure description https://github.com/orgs/vllm-project/projects/20 diff --git a/docs/contributing/ci/update_pytorch_version.md b/docs/contributing/ci/update_pytorch_version.md index 5046db11a4715..3a6026d450a67 100644 --- a/docs/contributing/ci/update_pytorch_version.md +++ b/docs/contributing/ci/update_pytorch_version.md @@ -57,8 +57,7 @@ cc the PyTorch release team to initiate discussion on how to address them. ## Update CUDA version -The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example, -`torch2.7.0+cu12.6`) is uploaded to PyPI. However, vLLM may require a different CUDA version, +The PyTorch release matrix includes both stable and experimental [CUDA versions](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix). Due to limitations, only the latest stable CUDA version (for example, torch `2.7.1+cu126`) is uploaded to PyPI. However, vLLM may require a different CUDA version, such as 12.8 for Blackwell support. This complicates the process as we cannot use the out-of-the-box `pip install torch torchvision torchaudio` command. The solution is to use @@ -107,6 +106,7 @@ releases (which would take too much time), they can be built from source to unblock the update process. ### FlashInfer + Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): ```bash @@ -122,6 +122,7 @@ public location for immediate installation, such as [this FlashInfer wheel link] team if you want to get the package published there. ### xFormers + Similar to FlashInfer, here is how to build and install xFormers from source: ```bash @@ -139,7 +140,7 @@ uv pip install --system \ ### causal-conv1d -``` +```bash uv pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8' ``` diff --git a/docs/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md index ff69cbae08b23..904ef4ca058c0 100644 --- a/docs/contributing/deprecation_policy.md +++ b/docs/contributing/deprecation_policy.md @@ -31,7 +31,7 @@ Features that fall under this policy include (at a minimum) the following: The deprecation process consists of several clearly defined stages that span multiple Y releases: -**1. Deprecated (Still On By Default)** +### 1. Deprecated (Still On By Default) - **Action**: Feature is marked as deprecated. - **Timeline**: A removal version is explicitly stated in the deprecation @@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0"). - GitHub Issue (RFC) for feedback - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs -**2.Deprecated (Off By Default)** +### 2.Deprecated (Off By Default) - **Action**: Feature is disabled by default, but can still be re-enabled via a CLI flag or environment variable. Feature throws an error when used without @@ -55,7 +55,7 @@ re-enabling. while signaling imminent removal. Ensures any remaining usage is clearly surfaced and blocks silent breakage before full removal. -**3. Removed** +### 3. Removed - **Action**: Feature is completely removed from the codebase. - **Note**: Only features that have passed through the previous deprecation diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 13c3bc2c7e031..74627e9062167 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -5,7 +5,12 @@ ## Profile with PyTorch Profiler -We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/` +We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`. Additionally, you can control the profiling content by specifying the following environment variables: + +- `VLLM_TORCH_PROFILER_RECORD_SHAPES=1` to enable recording Tensor Shapes, off by default +- `VLLM_TORCH_PROFILER_WITH_PROFILE_MEMORY=1` to record memory, off by default +- `VLLM_TORCH_PROFILER_WITH_STACK=1` to enable recording stack information, on by default +- `VLLM_TORCH_PROFILER_WITH_FLOPS=1` to enable recording FLOPs, off by default The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set. @@ -112,13 +117,13 @@ vllm bench serve \ In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run: -``` +```bash nsys sessions list ``` to get the session id in the form of `profile-XXXXX`, then run: -``` +```bash nsys stop --session=profile-XXXXX ``` diff --git a/docs/contributing/vulnerability_management.md b/docs/contributing/vulnerability_management.md index e20b10f8f7b32..847883f742974 100644 --- a/docs/contributing/vulnerability_management.md +++ b/docs/contributing/vulnerability_management.md @@ -32,9 +32,9 @@ We prefer to keep all vulnerability-related communication on the security report on GitHub. However, if you need to contact the VMT directly for an urgent issue, you may contact the following individuals: -- Simon Mo - simon.mo@hey.com -- Russell Bryant - rbryant@redhat.com -- Huzaifa Sidhpurwala - huzaifas@redhat.com +- Simon Mo - +- Russell Bryant - +- Huzaifa Sidhpurwala - ## Slack Discussion diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index e500751896b34..1f19f2fecfab1 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -10,23 +10,23 @@ The image can be used to run OpenAI compatible server and is available on Docker ```bash docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ - --env "HUGGING_FACE_HUB_TOKEN=" \ + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ - --model mistralai/Mistral-7B-v0.1 + --model Qwen/Qwen3-0.6B ``` This image can also be used with other container engines such as [Podman](https://podman.io/). ```bash -podman run --gpus all \ +podman run --device nvidia.com/gpu=all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ - vllm/vllm-openai:latest \ - --model mistralai/Mistral-7B-v0.1 + docker.io/vllm/vllm-openai:latest \ + --model Qwen/Qwen3-0.6B ``` You can add any other [engine-args](../configuration/engine_args.md) you need after the image tag (`vllm/vllm-openai:latest`). @@ -106,8 +106,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `-- -t vllm/vllm-gh200-openai:latest \ --build-arg max_jobs=66 \ --build-arg nvcc_threads=2 \ - --build-arg torch_cuda_arch_list="9.0 10.0+PTX" \ - --build-arg vllm_fa_cmake_gpu_arches="90-real" + --build-arg torch_cuda_arch_list="9.0 10.0+PTX" ``` !!! note diff --git a/docs/deployment/frameworks/anything-llm.md b/docs/deployment/frameworks/anything-llm.md index d6b28a358cc3d..e62a33b2085ca 100644 --- a/docs/deployment/frameworks/anything-llm.md +++ b/docs/deployment/frameworks/anything-llm.md @@ -19,9 +19,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 - Download and install [Anything LLM desktop](https://anythingllm.com/desktop). - On the bottom left of open settings, AI Prooviders --> LLM: - - LLM Provider: Generic OpenAI - - Base URL: http://{vllm server host}:{vllm server port}/v1 - - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` + - LLM Provider: Generic OpenAI + - Base URL: http://{vllm server host}:{vllm server port}/v1 + - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` ![](../../assets/deployment/anything-llm-provider.png) @@ -30,9 +30,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 ![](../../assets/deployment/anything-llm-chat-without-doc.png) - Click the upload button: - - upload the doc - - select the doc and move to the workspace - - save and embed + - upload the doc + - select the doc and move to the workspace + - save and embed ![](../../assets/deployment/anything-llm-upload-doc.png) diff --git a/docs/deployment/frameworks/chatbox.md b/docs/deployment/frameworks/chatbox.md index 15f92ed1e34df..cbca6e6282fc6 100644 --- a/docs/deployment/frameworks/chatbox.md +++ b/docs/deployment/frameworks/chatbox.md @@ -19,11 +19,11 @@ vllm serve qwen/Qwen1.5-0.5B-Chat - Download and install [Chatbox desktop](https://chatboxai.app/en#download). - On the bottom left of settings, Add Custom Provider - - API Mode: `OpenAI API Compatible` - - Name: vllm - - API Host: `http://{vllm server host}:{vllm server port}/v1` - - API Path: `/chat/completions` - - Model: `qwen/Qwen1.5-0.5B-Chat` + - API Mode: `OpenAI API Compatible` + - Name: vllm + - API Host: `http://{vllm server host}:{vllm server port}/v1` + - API Path: `/chat/completions` + - Model: `qwen/Qwen1.5-0.5B-Chat` ![](../../assets/deployment/chatbox-settings.png) diff --git a/docs/deployment/frameworks/dify.md b/docs/deployment/frameworks/dify.md index a3063194fb513..35f02c33cb02b 100644 --- a/docs/deployment/frameworks/dify.md +++ b/docs/deployment/frameworks/dify.md @@ -34,11 +34,11 @@ docker compose up -d - In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it. - Fill in the model provider details as follows: - - **Model Type**: `LLM` - - **Model Name**: `Qwen/Qwen1.5-7B-Chat` - - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` - - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` - - **Completion Mode**: `Completion` + - **Model Type**: `LLM` + - **Model Name**: `Qwen/Qwen1.5-7B-Chat` + - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` + - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` + - **Completion Mode**: `Completion` ![](../../assets/deployment/dify-settings.png) diff --git a/docs/deployment/frameworks/haystack.md b/docs/deployment/frameworks/haystack.md index a18d68142cabb..70b4b48d4543e 100644 --- a/docs/deployment/frameworks/haystack.md +++ b/docs/deployment/frameworks/haystack.md @@ -1,7 +1,5 @@ # Haystack -# Haystack - [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. diff --git a/docs/deployment/frameworks/retrieval_augmented_generation.md b/docs/deployment/frameworks/retrieval_augmented_generation.md index 96dd99e7118b6..d5f2ec302b6cd 100644 --- a/docs/deployment/frameworks/retrieval_augmented_generation.md +++ b/docs/deployment/frameworks/retrieval_augmented_generation.md @@ -3,6 +3,7 @@ [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. Here are the integrations: + - vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus) - vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus) diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 497f9f1a92a5d..fae392589c060 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -140,11 +140,12 @@ The core vLLM production stack configuration is managed with YAML. Here is the e ``` In this YAML configuration: + * **`modelSpec`** includes: - * `name`: A nickname that you prefer to call the model. - * `repository`: Docker repository of vLLM. - * `tag`: Docker image tag. - * `modelURL`: The LLM model that you want to use. + * `name`: A nickname that you prefer to call the model. + * `repository`: Docker repository of vLLM. + * `tag`: Docker image tag. + * `modelURL`: The LLM model that you want to use. * **`replicaCount`**: Number of replicas. * **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. * **`requestGPU`**: Specifies the number of GPUs required. diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index f244b0858eb6e..cad801a4312cc 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -5,7 +5,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le - [Deployment with CPUs](#deployment-with-cpus) - [Deployment with GPUs](#deployment-with-gpus) - [Troubleshooting](#troubleshooting) - - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) + - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) - [Conclusion](#conclusion) Alternatively, you can deploy vLLM to Kubernetes using any of the following: diff --git a/docs/design/automatic_prefix_caching.md b/docs/design/automatic_prefix_caching.md deleted file mode 100644 index 60e21f6ad0fcb..0000000000000 --- a/docs/design/automatic_prefix_caching.md +++ /dev/null @@ -1,40 +0,0 @@ -# Automatic Prefix Caching - -The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. - -To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. - -```text - Block 1 Block 2 Block 3 - [A gentle breeze stirred] [the leaves as children] [laughed in the distance] -Block 1: |<--- block tokens ---->| -Block 2: |<------- prefix ------>| |<--- block tokens --->| -Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->| -``` - -In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping: - -```text -hash(prefix tokens + block tokens) <--> KV Block -``` - -With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space. - -This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system. - -## Generalized Caching Policy - -Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full. - -Managing KV cache with a hash table allows us to implement flexible caching policies. As an example, in current vLLM, we implement the following eviction policy: - -* When there are no free blocks left, we will evict a KV block with reference count (i.e., number of current requests using the block) equals 0. -* If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU). -* If there are multiple blocks whose last access time are the same, we prioritize the eviction of the block that is at the end of the longest prefix (i.e., has the maximum number of blocks before it). - -Note that this eviction policy effectively implements the exact policy as in [RadixAttention](https://lmsys.org/blog/2024-01-17-sglang/) when applied to models with full attention, which prioritizes to evict reference count zero and least recent used leaf nodes in the prefix tree. - -However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above: - -* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency. -* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images. diff --git a/docs/design/fused_moe_modular_kernel.md b/docs/design/fused_moe_modular_kernel.md new file mode 100644 index 0000000000000..3ef1232051b07 --- /dev/null +++ b/docs/design/fused_moe_modular_kernel.md @@ -0,0 +1,259 @@ +# Fused MoE Modular Kernel + +## Introduction + +FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py) + +Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. + +* Contiguous / Standard / Non-Batched, and +* Batched + +!!! note + The terms Contiguous, Standard, and Non-Batched are used interchangeably throughout the document. + +The input activation format completely depends on the All2All Dispatch being used. + +* In the Contiguous variant, the All2All Dispatch returns the activations as a contiguous tensor of shape (M, K) along with TopK Ids and TopK weights of shape (M, num_topk). Look at `DeepEPHTPrepareAndFinalize` for an example. +* In the Batched variant, the All2All Dispatch returns the activations as a tensor of shape (num_experts, max_tokens, K). Here, the activations/tokens that subscribe to the same expert are batched together. Note that not all entries of the tensor are valid. The activations tensor is typically accompanied by an `expert_num_tokens` tensor of size `num_experts`, where `expert_num_tokens[i]` indicates the number of valid tokens that subscribe to the ith expert. Look at `PplxPrepareAndFinalize` or `DeepEPLLPrepareAndFinalize` for an example. + +The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below + +![](../assets/design/fused_moe_modular_kernel/fused_moe_non_batched.png "FusedMoE Non-Batched") + +![](../assets/design/fused_moe_modular_kernel/fused_moe_batched.png "FusedMoE Batched") + +!!! note + The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain. + +## Motivation + +As can be seen from the diagrams, there are a lot of operations and there can be a variety of implementations for each operation. The set of ways the operations can be put together to make a valid FusedMoE implementation quickly becomes intractable. The Modular Kernel framework addresses this issue, by grouping the operations into logical components. This broad categorization makes the combinations manageable and prevents code-duplication. This also decouples the All2All Dispatch & Combine implementations from the FusedMoE implementations and allows for their independent development and testing. Furthermore, the Modular Kernel framework introduces Abstract classes for the different components thus providing a well-defined skeleton for future implementations. + +The rest of the document will focus on the Contiguous / Non-Batched case. Extrapolating to the Batched case should be straight-forward. + +## ModularKernel Components + +FusedMoEModularKernel splits the FusedMoE operation into 3 parts, + +1. TopKWeightAndReduce +2. FusedMoEPrepareAndFinalize +3. FusedMoEPermuteExpertsUnpermute + +### TopKWeightAndReduce + +The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class. + +Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py). + +`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method. +The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens. + +* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceNoOp` if the `FusedMoEPermuteExpertsUnpermute` implementation does the weight application and reduction itself. +* `FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` method returns `TopKWeightAndReduceContiguous` / `TopKWeightAndReduceNaiveBatched` / `TopKWeightAndReduceDelegate` if the `FusedMoEPermuteExpertsUnpermute` implementation needs the `FusedMoEPrepareAndFinalize::finalize()` to do the weight application and reduction. + +### FusedMoEPrepareAndFinalize + +The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare` and `finalize` functions. +The `prepare` function is responsible for input activation Quantization and All2All Dispatch. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section) + +![](../assets/design/fused_moe_modular_kernel/prepare_and_finalize_blocks.png "FusedMoEPrepareAndFinalize Blocks") + +### FusedMoEPermuteExpertsUnpermute + +The `FusedMoEPermuteExpertsUnpermute` class is where the crux of the MoE operations happen. The `FusedMoEPermuteExpertsUnpermute` abstract class exposes a few important functions, + +* apply() +* workspace_shapes() +* finalize_weight_and_reduce_impl() + +#### apply() + +The `apply` method is where the implementations perform + +* Permute +* Matmul with weight W1 +* Act + Mul +* Quantization +* Matmul with weight W2 +* Unpermute +* Maybe TopK Weight Application + Reduction + +#### workspace_shapes() + +The core FusedMoE implementation performs a series of operations. It would be inefficient to create output memory for each of these operations separately. To that effect, implementations are required to declare 2 workspace shapes, the workspace datatype and the FusedMoE output shape as outputs of the workspace_shapes() method. This information is used to allocate the workspace tensors and the output tensor in `FusedMoEModularKernel::forward()` and passed on to the `FusedMoEPermuteExpertsUnpermute::apply()` method. The workspaces could then be used as intermediate buffers in the FusedMoE implementation. + +#### finalize_weight_and_reduce_impl() + +It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section. +`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use. + +![](../assets/design/fused_moe_modular_kernel/fused_experts_blocks.png "FusedMoEPermuteExpertsUnpermute Blocks") + +### FusedMoEModularKernel + +`FusedMoEModularKernel` is composed of the `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` objects. +`FusedMoEModularKernel` pseudocode/sketch, + +```py +class FusedMoEModularKernel: + def __init__(self, + prepare_finalize: FusedMoEPrepareAndFinalize, + fused_experts: FusedMoEPermuteExpertsUnpermute): + + self.prepare_finalize = prepare_finalize + self.fused_experts = fused_experts + + def forward(self, DP_A): + + Aq, A_scale, _, _, _ = self.prepare_finalize.prepare(DP_A, ...) + + workspace13_shape, workspace2_shape, _, _ = self.fused_experts.workspace_shapes(...) + + # allocate workspaces + workspace_13 = torch.empty(workspace13_shape, ...) + workspace_2 = torch.empty(workspace2_shape, ...) + + # execute fused_experts + fe_out = self.fused_experts.apply(Aq, A_scale, workspace13, workspace2, ...) + + # war_impl is an object of type TopKWeightAndReduceNoOp if the fused_experts implementations + # performs the TopK Weight Application and Reduction. + war_impl = self.fused_experts.finalize_weight_and_reduce_impl() + + output = self.prepare_finalize.finalize(fe_out, war_impl,...) + + return output +``` + +## How-To + +### How To Add a FusedMoEPrepareAndFinalize Type + +Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & Combine implementation / kernel. For example, + +* PplxPrepareAndFinalize type is backed by Pplx All2All kernels, +* DeepEPHTPrepareAndFinalize type is backed by DeepEP High-Throughtput All2All kernels, and +* DeepEPLLPrepareAndFinalize type is backed by DeepEP Low-Latency All2All kernels. + +#### Step 1: Add an All2All manager + +The purpose of the All2All Manager is to setup the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py). + +#### Step 2: Add a FusedMoEPrepareAndFinalize Type + +This section describes the significance of the various functions exposed by the `FusedMoEPrepareAndFinalize` abstract class. + +`FusedMoEPrepareAndFinalize::prepare()`: The prepare method implements the Quantization and All2All Dispatch. Typically the Dispatch function from the relevant All2All Manager is invoked. + +`FusedMoEPrepareAndFinalize::finalize()`: Maybe perform TopK Weight Application and Reduction and All2All Combine. Typically the Combine function from the relevant All2AllManager is invoked. + +`FusedMoEPrepareAndFinalize::activation_format()`: Return `FusedMoEActivationFormat.BatchedExperts` if the output of the prepare method (i.e. the All2All dispatch) is Batched. Return `FusedMoEActivationFormat.Standard` otherwise. + +`FusedMoEPrepareAndFinalize::topk_indices_dtype()`: Data type of the TopK ids. Some All2All kernels have strict requirements pertaining to the data type of the TopK ids. This requirement is passed on to the `FusedMoe::select_experts` function so it could be respected. If there are no strict requirements return None. + +`FusedMoEPrepareAndFinalize::max_num_tokens_per_rank()`: This is the maximum number of tokens that would be submitted to the All2All Dispatch at once. + +`FusedMoEPrepareAndFinalize::num_dispatchers()`: Total number of dispatching units. This value determines the size of the Dispatch output. The Dispatch output is of shape (num_local_experts, max_num_tokens, K). Here max_num_tokens = num_dispatchers() * max_num_tokens_per_rank(). + +We suggest picking an already existing `FusedMoEPrepareAndFinalize` implementation that matches your All2All implementation closely and using it as a reference. + +### How To Add a FusedMoEPermuteExpertsUnpermute Type + +FusedMoEPermuteExpertsUnpermute performs the core of the FusedMoE operations. The various functions exposed by the abstract class and their significance is as follows, + +`FusedMoEPermuteExpertsUnpermute::activation_formats()`: Return the supported Input and Output activation formats. i.e. Contiguous / Batched format. + +`FusedMoEPermuteExpertsUnpermute::supports_chunking()`: Return True if the implementation supports chunking. Typically +implementations that input `FusedMoEActivationFormat.Standard` support chunking and `FusedMoEActivationFormat.BatchedExperts` do not. + +`FusedMoEPermuteExpertsUnpermute::supports_expert_map()`: Return True if the implementation supports expert map. + +`FusedMoEPermuteExpertsUnpermute::workspace_shapes()` / +`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl` / +`FusedMoEPermuteExpertsUnpermute::apply`: Refer to `FusedMoEPermuteExpertsUnpermute` section above. + +### FusedMoEModularKernel Initialization + +`FusedMoEMethodBase` class has 2 methods that are collectively responsible in creating the `FusedMoEModularKernel` object. They are, + +* select_gemm_impl, and +* init_prepare_finalize + +#### select_gemm_impl + +The `select_gemm_impl` method is undefined in the base class. It is the responsibility of the derived class to implement a method that constructs a valid/appropriate `FusedMoEPermuteExpertsUnpermute` object. +Please refer to the implementations in, + +* `UnquantizedFusedMoEMethod` +* `CompressedTensorsW8A8Fp8MoEMethod` +* `CompressedTensorsW8A8Fp8MoECutlassMethod` +* `Fp8MoEMethod` +* `ModelOptNvFp4FusedMoE` +dervied classes. + +#### init_prepare_finalize + +Based on the input and env settings, the `init_prepare_finalize` method creates the appropriate `FusedMoEPrepareAndFinalize` object. The method then queries `select_gemm_impl` for the appropriate `FusedMoEPermuteExpertsUnpermute` object and builds the `FusedMoEModularKernel` object + +Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vllm/blob/1cbf951ba272c230823b947631065b826409fa62/vllm/model_executor/layers/fused_moe/layer.py#L188). +**Important**: The `FusedMoEMethodBase` derived classes use the `FusedMoEMethodBase::fused_experts` object in their `apply` methods. When settings permit the construction of a valid `FusedMoEModularKernel` object, we override `FusedMoEMethodBase::fused_experts` with it. This essentially makes the derived classes agnostic to what FusedMoE implementation is used. + +### How To Unit Test + +We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py). + +The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are +compatible, runs some correctness tests. +If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations, + +1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](gh-file:tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively. +2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`, +`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`, +`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](gh-file:tests/kernels/moe/modular_kernel_tools/common.py) + +Doing this will add the new implementation to the test suite. + +### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility + +The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script. +Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` +As a side-effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked +with incompatible types, the script will error. + +### How To Profile + +Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py) +The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible +`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types. +Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` + +## FusedMoEPrepareAndFinalize Implementations + +The following table lists the `FusedMoEPrepareAndFinalize` implementations at the time of writing, + +| Implementation | Type | Comments | +| :--- | :--- | :--- | +| DeepEPHTPrepareAndFinalize | Contiguous / Non-Batched | Uses the DeepEP High-Throughput all2all kernels. | +| DeepEPLLPrepareAndFinalize | Batched | Uses the DeepEP Low-Latency all2all kernels. | +| PplxPrepareAndFinalize | Batched | Uses the Perplexity all2all kernels. | +| FlashInferCutlassMoEPrepareAndFinalize | Contiguous | | +| MoEPrepareAndFinalizeNoEP | Contiguous | This implementation is used when there is no EP. i.e. no all2all kernels are invoked. | +| BatchedPrepareAndFinalize | Batched | A reference prepare/finalize class that reorganizes the tokens into expert batched format, i.e. E x max_num_tokens x K. (Doesn’t use any all2all kernels. This is primarily used in unit testing) | + +## FusedMoEPermuteExpertsUnpermute + +The following table lists the `FusedMoEPermuteExpertsUnpermute` implementations at the time of writing, + +| Implementation | Type | Comment | +| :--- | :--- | :--- | +| BatchedDeepGemmExperts | Batched | Uses the DeepGemm’s Masked Grouped Gemm kernels for the fused_moe operation. | +| BatchedTritonExperts | Batched | Uses a Triton Kernel for the Batched matmuls. | +| BatchedTritonOrDeepGemmExperts | Batched | Chooses either the `BatchedDeepGemmExperts` or `BatchedTritonExperts` based on environment settings. | +| DeepGemmExperts | Contiguous / Non-Batched | Uses DeepGemm’s Grouped Gemm kernels for fused_moe operation. | +| TritonExperts | Contiguous / Non-Batched | Uses a Triton Kernel for fused_moe matmuls. | +| TritonOrDeepGemmExperts | Contiguous / Non-Batched | Chooses either the `DeepGemmExperts` or `TritonExperts` based on fused_moe inputs. | +| CutlassExpertsFP8 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp8 matmuls. | +| CutlassExpertsFP4 | Supports both Batched and Contiguous formats | Uses Cutlass Grouped Gemm implementations for the fp4 matmuls. | +| FlashInferExperts | Contiguous | Uses fused_moe operation from FlashInfer | +| NaiveBatchedExperts | Batched | Reference Batched Experts implementation. Primarily used in unit tests. | diff --git a/docs/design/huggingface_integration.md b/docs/design/huggingface_integration.md index 7b01313ddb00a..5a7582c86d49f 100644 --- a/docs/design/huggingface_integration.md +++ b/docs/design/huggingface_integration.md @@ -1,4 +1,4 @@ -# Integration with HuggingFace +# Integration with Hugging Face This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`. diff --git a/docs/design/v1/metrics.md b/docs/design/metrics.md similarity index 99% rename from docs/design/v1/metrics.md rename to docs/design/metrics.md index 52cd320dd4e11..1f65331d3c0a9 100644 --- a/docs/design/v1/metrics.md +++ b/docs/design/metrics.md @@ -223,7 +223,7 @@ And the calculated intervals are: Put another way: -![Interval calculations - common case](../../assets/design/v1/metrics/intervals-1.png) +![Interval calculations - common case](../assets/design/metrics/intervals-1.png) We explored the possibility of having the frontend calculate these intervals using the timing of events visible by the frontend. However, @@ -238,13 +238,13 @@ When a preemption occurs during decode, since any already generated tokens are reused, we consider the preemption as affecting the inter-token, decode, and inference intervals. -![Interval calculations - preempted decode](../../assets/design/v1/metrics/intervals-2.png) +![Interval calculations - preempted decode](../assets/design/metrics/intervals-2.png) When a preemption occurs during prefill (assuming such an event is possible), we consider the preemption as affecting the time-to-first-token and prefill intervals. -![Interval calculations - preempted prefill](../../assets/design/v1/metrics/intervals-3.png) +![Interval calculations - preempted prefill](../assets/design/metrics/intervals-3.png) ### Frontend Stats Collection @@ -361,7 +361,7 @@ instances in Prometheus. We use this concept for the `vllm:cache_config_info` metric: -``` +```text # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # TYPE vllm:cache_config_info gauge vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 @@ -686,7 +686,7 @@ documentation for this option states: The metrics were added by and who up in an OpenTelemetry trace as: -``` +```text -> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117) -> gen_ai.latency.time_in_model_forward: Double(3.151565277099609) -> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676) diff --git a/docs/design/v1/multiprocessing.md b/docs/design/multiprocessing.md similarity index 100% rename from docs/design/v1/multiprocessing.md rename to docs/design/multiprocessing.md diff --git a/docs/design/v1/p2p_nccl_connector.md b/docs/design/p2p_nccl_connector.md similarity index 94% rename from docs/design/v1/p2p_nccl_connector.md rename to docs/design/p2p_nccl_connector.md index 9d334f8873d97..adf838306bc77 100644 --- a/docs/design/v1/p2p_nccl_connector.md +++ b/docs/design/p2p_nccl_connector.md @@ -1,8 +1,11 @@ +# P2P NCCL Connector + An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo. -# Detailed Design +## Detailed Design + +### Overall Process -## Overall Process As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow: 1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface. @@ -15,13 +18,13 @@ As shown in Figure 1, the overall process of this **PD disaggregation** solution ![image1](https://github.com/user-attachments/assets/fb01bde6-755b-49f7-ad45-48a94b1e10a7) -## Proxy/Router (Demo) +### Proxy/Router (Demo) A simple HTTP service acts as the entry point for client requests and starts a background thread to listen for P/D instances reporting their HTTP IP and PORT, as well as ZMQ IP and PORT. It maintains a dictionary of `http_addr -> zmq_addr`. The `http_addr` is the IP:PORT for the vLLM instance's request, while the `zmq_addr` is the address for KV cache handshake and metadata reception. The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example: -``` +```text cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0 ``` @@ -29,13 +32,13 @@ Currently, to quickly verify whether xPyD can work, a round-robin selection of 1 Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (currently every 3 seconds) to register (i.e., report `http_addr -> zmq_addr`) and keep the connection alive. If an instance crashes and fails to send a ping for a certain period of time, the Proxy/Router will remove the timed-out instance (this feature has not yet been developed). -## KV Cache Transfer Methods +### KV Cache Transfer Methods There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache. Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT. -## P2P Communication via ZMQ & NCCL +### P2P Communication via ZMQ & NCCL As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart. @@ -43,7 +46,7 @@ Each P/D instance only needs to create a single `P2pNcclEngine` instance. This i When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size. -## NCCL Group Topology +### NCCL Group Topology Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance. @@ -51,7 +54,7 @@ Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCa Each NCCL group occupies a certain amount of GPU memory buffer for communication, the size of which is primarily influenced by the `NCCL_MAX_NCHANNELS` environment variable. When `NCCL_MAX_NCHANNELS=16`, an NCCL group typically occupies 100MB, while when `NCCL_MAX_NCHANNELS=8`, it usually takes up 52MB. For large-scale xPyD configurations—such as DeepSeek's 96P144D—this implementation is currently not feasible. Moving forward, we are considering using RDMA for point-to-point communication and are also keeping an eye on UCCL. -## GPU Memory Buffer and Tensor Memory Pool +### GPU Memory Buffer and Tensor Memory Pool The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size. @@ -59,15 +62,16 @@ If the `--max-num-seqs` parameter for P instances is set to a large value, due t To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store. -# Install vLLM +## Install vLLM ```shell pip install "vllm>=0.9.2" ``` -# Run xPyD +## Run xPyD + +### Instructions -## Instructions - The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model. - Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput. - For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance. @@ -79,16 +83,16 @@ pip install "vllm>=0.9.2" - Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`. - In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**. -## Run 1P3D +### Run 1P3D -### Proxy (e.g. 10.0.1.1) +#### Proxy (e.g. 10.0.1.1) ```shell cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ python3 disagg_proxy_p2p_nccl_xpyd.py & ``` -### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) +#### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ??? console "Command" @@ -105,12 +109,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & ``` -### Decode1 (e.g. 10.0.1.3 or 10.0.1.1) +#### Decode1 (e.g. 10.0.1.3 or 10.0.1.1) ??? console "Command" @@ -127,12 +130,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & ``` -### Decode2 (e.g. 10.0.1.4 or 10.0.1.1) +#### Decode2 (e.g. 10.0.1.4 or 10.0.1.1) ??? console "Command" @@ -149,12 +151,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & ``` -### Decode3 (e.g. 10.0.1.5 or 10.0.1.1) +#### Decode3 (e.g. 10.0.1.5 or 10.0.1.1) ??? console "Command" @@ -171,21 +172,20 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & ``` -## Run 3P1D +### Run 3P1D -### Proxy (e.g. 10.0.1.1) +#### Proxy (e.g. 10.0.1.1) ```shell cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/ python3 disagg_proxy_p2p_nccl_xpyd.py & ``` -### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) +#### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1) ??? console "Command" @@ -202,12 +202,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 & ``` -### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1) +#### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1) ??? console "Command" @@ -224,12 +223,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 & ``` -### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1) +#### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1) ??? console "Command" @@ -246,12 +244,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.9 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 & ``` -### Decode1 (e.g. 10.0.1.5 or 10.0.1.1) +#### Decode1 (e.g. 10.0.1.5 or 10.0.1.1) ??? console "Command" @@ -268,12 +265,11 @@ python3 disagg_proxy_p2p_nccl_xpyd.py & --max-num-seqs 256 \ --trust-remote-code \ --gpu-memory-utilization 0.7 \ - --disable-log-request \ --kv-transfer-config \ '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 & ``` -# Single request +## Single request ```shell curl -X POST -s http://10.0.1.1:10001/v1/completions \ @@ -286,7 +282,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \ }' ``` -# Benchmark +## Benchmark ??? console "Command" @@ -310,14 +306,14 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \ --num-prompts 1000 ``` -# Shut down +## Shut down ```shell pgrep python | xargs kill -9 && pkill -f python ``` -# Test data +## Test data -## **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s +### **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s ![testdata](https://github.com/user-attachments/assets/cef0953b-4567-4bf9-b940-405b92a28eb1) diff --git a/docs/design/kernel/paged_attention.md b/docs/design/paged_attention.md similarity index 94% rename from docs/design/kernel/paged_attention.md rename to docs/design/paged_attention.md index 94bfa97ee2217..fb991a35caf30 100644 --- a/docs/design/kernel/paged_attention.md +++ b/docs/design/paged_attention.md @@ -1,4 +1,8 @@ -# vLLM Paged Attention +# Paged Attention + +!!! warning + This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180). + It no longer describes the code used in vLLM today. Currently, vLLM utilizes its own implementation of a multi-head query attention kernel (`csrc/attention/attention_kernels.cu`). @@ -136,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE; ```
- ![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" } + ![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
Each thread defines its own `q_ptr` which points to the assigned @@ -145,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains total of 128 elements divided into 128 / 4 = 32 vecs.
- ![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" } + ![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
```cpp @@ -184,7 +188,7 @@ points to key token data based on `k_cache` at assigned block, assigned head and assigned token.
- ![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" } + ![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
The diagram above illustrates the memory layout for key data. It @@ -199,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one thread group) separately.
- ![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" } + ![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
```cpp @@ -358,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of ## Value
- ![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" } + ![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
- ![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" } + ![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
- ![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" } + ![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
Now we need to retrieve the value data and perform dot multiplication @@ -495,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { Finally, we need to iterate over different assigned head positions and write out the corresponding accumulated result based on the `out_ptr`. + +## Citation + +```bibtex +@inproceedings{kwon2023efficient, + title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, + author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, + booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, + year={2023} +} +``` diff --git a/docs/design/plugin_system.md b/docs/design/plugin_system.md index 23a05ac719ce2..ca1c2c2305d91 100644 --- a/docs/design/plugin_system.md +++ b/docs/design/plugin_system.md @@ -1,4 +1,4 @@ -# vLLM's Plugin System +# Plugin System The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. diff --git a/docs/design/v1/prefix_caching.md b/docs/design/prefix_caching.md similarity index 90% rename from docs/design/v1/prefix_caching.md rename to docs/design/prefix_caching.md index 2d3c8412894a6..9941837bf1652 100644 --- a/docs/design/v1/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified * Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision. * Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments. -> **Note 1:** We only cache full blocks. +!!! note "Note 1" + We only cache full blocks. -> **Note 2:** The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. -SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). +!!! note "Note 2" + The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. + SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). **A hashing example with multi-modality inputs** In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages: @@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others. -> **Note:** Cache isolation is not supported in engine V0. +!!! note + Cache isolation is not supported in engine V0. ## Data Structure @@ -122,7 +125,7 @@ There are two design points to highlight: As a result, we will have the following components when the KV cache manager is initialized: -![Component Overview](../../assets/design/v1/prefix_caching/overview.png) +![Component Overview](../assets/design/prefix_caching/overview.png) * Block Pool: A list of KVCacheBlock. * Free Block Queue: Only store the pointers of head and tail blocks for manipulations. @@ -192,7 +195,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first. -![Free queue after a request us freed](../../assets/design/v1/prefix_caching/free.png) +![Free queue after a request us freed](../assets/design/prefix_caching/free.png) ### Eviction (LRU) @@ -208,24 +211,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens), **Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens. -![Example Time 1](../../assets/design/v1/prefix_caching/example-time-1.png) +![Example Time 1](../assets/design/prefix_caching/example-time-1.png) **Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. -![Example Time 3](../../assets/design/v1/prefix_caching/example-time-3.png) +![Example Time 3](../assets/design/prefix_caching/example-time-3.png) **Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. -![Example Time 4](../../assets/design/v1/prefix_caching/example-time-4.png) +![Example Time 4](../assets/design/prefix_caching/example-time-4.png) **Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. -![Example Time 5](../../assets/design/v1/prefix_caching/example-time-5.png) +![Example Time 5](../assets/design/prefix_caching/example-time-5.png) **Time 6: Request 1 is finished and free.** -![Example Time 6](../../assets/design/v1/prefix_caching/example-time-6.png) +![Example Time 6](../assets/design/prefix_caching/example-time-6.png) **Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). -![Example Time 7](../../assets/design/v1/prefix_caching/example-time-7.png) +![Example Time 7](../assets/design/prefix_caching/example-time-7.png) diff --git a/docs/design/v1/torch_compile.md b/docs/design/torch_compile.md similarity index 99% rename from docs/design/v1/torch_compile.md rename to docs/design/torch_compile.md index ea5d8ac212f7a..47ac4958dbf7f 100644 --- a/docs/design/v1/torch_compile.md +++ b/docs/design/torch_compile.md @@ -1,4 +1,4 @@ -# vLLM's `torch.compile` integration +# `torch.compile` integration In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage. @@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d In the very verbose logs, we can see: -``` +```console INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile ``` @@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual In the very verbose logs, we can also see: -``` +```console DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py') ... @@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs: -``` +```console DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') ``` diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md index 8be1585f8e76b..5b08b3810776c 100644 --- a/docs/features/compatibility_matrix.md +++ b/docs/features/compatibility_matrix.md @@ -34,23 +34,26 @@ th:not(:first-child) { } -| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | pooling | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | +| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| -| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | | -| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | -| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | +| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | +| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | +| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | -| pooling | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | | +| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | enc-dec | ❌ | [❌](gh-issue:7366) | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | | logP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | | async output | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | | multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | -| mm | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | +| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](gh-pr:4194)^ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | | best-of | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ✅ | ✅ | | | beam-search | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ❔ | ✅ | ✅ | +\* Chunked prefill and prefix caching are only applicable to last-token pooling. +^ LoRA is only applicable to the language backbone of multimodal models. + [](){ #feature-x-hardware } ## Feature x Hardware @@ -62,9 +65,9 @@ th:not(:first-child) { | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | -| pooling | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ❌ | +| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | enc-dec | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | -| mm | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | +| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | | async output | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | diff --git a/docs/features/lora.md b/docs/features/lora.md index ea1b495138c1b..a4e05dae11c2e 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -119,6 +119,7 @@ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True ``` ### Using API Endpoints + Loading a LoRA Adapter: To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary @@ -156,6 +157,7 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \ ``` ### Using Plugins + Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter. You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds. diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index d4c8852206bba..cdd32924b5668 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -172,6 +172,36 @@ Multi-image input can be extended to perform video captioning. We show this with print(generated_text) ``` +#### Custom RGBA Background Color + +When loading RGBA images (images with transparency), vLLM converts them to RGB format. By default, transparent pixels are replaced with white background. You can customize this background color using the `rgba_background_color` parameter in `media_io_kwargs`. + +??? code + + ```python + from vllm import LLM + + # Default white background (no configuration needed) + llm = LLM(model="llava-hf/llava-1.5-7b-hf") + + # Custom black background for dark theme + llm = LLM( + model="llava-hf/llava-1.5-7b-hf", + media_io_kwargs={"image": {"rgba_background_color": [0, 0, 0]}} + ) + + # Custom brand color background (e.g., blue) + llm = LLM( + model="llava-hf/llava-1.5-7b-hf", + media_io_kwargs={"image": {"rgba_background_color": [0, 0, 255]}} + ) + ``` + +!!! note + - The `rgba_background_color` accepts RGB values as a list `[R, G, B]` or tuple `(R, G, B)` where each value is 0-255 + - This setting only affects RGBA images with transparency; RGB images are unchanged + - If not specified, the default white background `(255, 255, 255)` is used for backward compatibility + ### Video Inputs You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary @@ -478,6 +508,20 @@ Full example: ``` +#### Custom RGBA Background Color + +To use a custom background color for RGBA images, pass the `rgba_background_color` parameter via `--media-io-kwargs`: + +```bash +# Example: Black background for dark theme +vllm serve llava-hf/llava-1.5-7b-hf \ + --media-io-kwargs '{"image": {"rgba_background_color": [0, 0, 0]}}' + +# Example: Custom gray background +vllm serve llava-hf/llava-1.5-7b-hf \ + --media-io-kwargs '{"image": {"rgba_background_color": [128, 128, 128]}}' +``` + ### Audio Inputs Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in). @@ -588,7 +632,9 @@ Full example: +th { + white-space: nowrap; + min-width: 0 !important; +} + + | Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | AWS Neuron | Google TPU | |-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|--------------| -| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | -| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | -| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | -| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | -| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ | +| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | +| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ | +| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | +| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | +| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ | | BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | -| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | -| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ | +| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | +| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ | - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - ✅︎ indicates that the quantization method is supported on the specified hardware. diff --git a/docs/features/quantization/torchao.md b/docs/features/quantization/torchao.md index ab6802177048b..6932445997012 100644 --- a/docs/features/quantization/torchao.md +++ b/docs/features/quantization/torchao.md @@ -13,6 +13,7 @@ pip install \ ``` ## Quantizing HuggingFace Models + You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code: ??? code diff --git a/docs/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md index 6b84eca275309..04b943efbbbb4 100644 --- a/docs/features/reasoning_outputs.md +++ b/docs/features/reasoning_outputs.md @@ -123,13 +123,12 @@ OpenAI Python client library does not officially support `reasoning_content` att printed_content = False for chunk in stream: - reasoning_content = None - content = None - # Check the content is reasoning_content or content - if hasattr(chunk.choices[0].delta, "reasoning_content"): - reasoning_content = chunk.choices[0].delta.reasoning_content - elif hasattr(chunk.choices[0].delta, "content"): - content = chunk.choices[0].delta.content + # Safely extract reasoning_content and content from delta, + # defaulting to None if attributes don't exist or are empty strings + reasoning_content = ( + getattr(chunk.choices[0].delta, "reasoning_content", None) or None + ) + content = getattr(chunk.choices[0].delta, "content", None) or None if reasoning_content is not None: if not printed_reasoning_content: diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md index be4b91feda7aa..89d5b489e1888 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode.md @@ -15,6 +15,10 @@ Speculative decoding is a technique which improves inter-token latency in memory The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time. +!!! warning + In vllm v0.10.0, speculative decoding with a draft model is not supported. + If you use the following code, you will get a `NotImplementedError`. + ??? code ```python diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md index 4f737afa80f55..8a934d406f382 100644 --- a/docs/features/structured_outputs.md +++ b/docs/features/structured_outputs.md @@ -103,7 +103,7 @@ The next example shows how to use the `guided_json` parameter with a Pydantic mo "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's", } ], - "response_format": { + response_format={ "type": "json_schema", "json_schema": { "name": "car-description", diff --git a/docs/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md index 2d2598da943c7..7a34d47d8e494 100644 --- a/docs/getting_started/installation/cpu.md +++ b/docs/getting_started/installation/cpu.md @@ -164,7 +164,7 @@ Note, it is recommended to manually reserve 1 CPU for vLLM front-end process whe ### How to decide `VLLM_CPU_KVCACHE_SPACE`? - - This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory. +This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory. ### How to do performance tuning for vLLM CPU? @@ -183,13 +183,13 @@ vLLM CPU supports tensor parallel (TP) and pipeline parallel (PP) to leverage mu ### Which quantization configs does vLLM CPU support? - - vLLM CPU supports quantizations: +- vLLM CPU supports quantizations: - AWQ (x86 only) - GPTQ (x86 only) - compressed-tensor INT8 W8A8 (x86, s390x) ### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`? - - Both of them requires `amx` CPU flag. +- Both of them requires `amx` CPU flag. - `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models - `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios. diff --git a/docs/getting_started/installation/cpu/apple.inc.md b/docs/getting_started/installation/cpu/apple.inc.md index 0816f38ac68a1..2828173a76a9a 100644 --- a/docs/getting_started/installation/cpu/apple.inc.md +++ b/docs/getting_started/installation/cpu/apple.inc.md @@ -1,6 +1,6 @@ # --8<-- [start:installation] -vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS. +vLLM has experimental support for macOS with Apple silicon. For now, users must build from source to natively run on macOS. Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. @@ -23,20 +23,20 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes. # --8<-- [end:pre-built-wheels] # --8<-- [start:build-wheel-from-source] -After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source. +After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source. ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -pip install -r requirements/cpu.txt -pip install -e . +uv pip install -r requirements/cpu.txt +uv pip install -e . ``` !!! note - On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device. + On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which is currently the only supported device. !!! example "Troubleshooting" - If the build has error like the following snippet where standard C++ headers cannot be found, try to remove and reinstall your + If the build fails with errors like the following where standard C++ headers cannot be found, try to remove and reinstall your [Command Line Tools for Xcode](https://developer.apple.com/download/all/). ```text diff --git a/docs/getting_started/installation/cpu/build.inc.md b/docs/getting_started/installation/cpu/build.inc.md index fa777fe0c8a1a..57a09e674a821 100644 --- a/docs/getting_started/installation/cpu/build.inc.md +++ b/docs/getting_started/installation/cpu/build.inc.md @@ -1,4 +1,4 @@ -First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: +First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: ```bash sudo apt-get update -y @@ -6,28 +6,34 @@ sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certific sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 ``` -Second, clone vLLM project: +Second, clone the vLLM project: ```bash git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source ``` -Third, install Python packages for vLLM CPU backend building: +Third, install required dependencies: ```bash -pip install --upgrade pip -pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu -pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu +uv pip install -r requirements/cpu-build.txt --torch-backend auto +uv pip install -r requirements/cpu.txt --torch-backend auto ``` -Finally, build and install vLLM CPU backend: +??? console "pip" + ```bash + pip install --upgrade pip + pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu + pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu + ``` + +Finally, build and install vLLM: ```bash VLLM_TARGET_DEVICE=cpu python setup.py install ``` -If you want to develop vllm, install it in editable mode instead. +If you want to develop vLLM, install it in editable mode instead. ```bash VLLM_TARGET_DEVICE=cpu python setup.py develop diff --git a/docs/getting_started/installation/cpu/s390x.inc.md b/docs/getting_started/installation/cpu/s390x.inc.md index acfb3396896bf..c1917267ce91b 100644 --- a/docs/getting_started/installation/cpu/s390x.inc.md +++ b/docs/getting_started/installation/cpu/s390x.inc.md @@ -1,6 +1,6 @@ # --8<-- [start:installation] -vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform. +vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform. Currently the CPU implementation for s390x architecture supports FP32 datatype only. @@ -40,21 +40,32 @@ curl https://sh.rustup.rs -sSf | sh -s -- -y && \ . "$HOME/.cargo/env" ``` -Execute the following commands to build and install vLLM from the source. +Execute the following commands to build and install vLLM from source. !!! tip - Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM. + Please build the following dependencies, `torchvision`, `pyarrow` from source before building vLLM. ```bash sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds - pip install -v \ - --extra-index-url https://download.pytorch.org/whl/nightly/cpu \ + uv pip install -v \ + --torch-backend auto \ -r requirements-build.txt \ -r requirements-cpu.txt \ VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \ - pip install dist/*.whl + uv pip install dist/*.whl ``` +??? console "pip" + ```bash + sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds + pip install -v \ + --extra-index-url https://download.pytorch.org/whl/nightly/cpu \ + -r requirements-build.txt \ + -r requirements-cpu.txt \ + VLLM_TARGET_DEVICE=cpu python setup.py bdist_wheel && \ + pip install dist/*.whl + ``` + # --8<-- [end:build-wheel-from-source] # --8<-- [start:pre-built-images] @@ -63,19 +74,19 @@ Execute the following commands to build and install vLLM from the source. ```bash docker build -f docker/Dockerfile.s390x \ - --tag vllm-cpu-env . + --tag vllm-cpu-env . -# Launching OpenAI server +# Launch OpenAI server docker run --rm \ - --privileged=true \ - --shm-size=4g \ - -p 8000:8000 \ - -e VLLM_CPU_KVCACHE_SPACE= \ - -e VLLM_CPU_OMP_THREADS_BIND= \ - vllm-cpu-env \ - --model=meta-llama/Llama-3.2-1B-Instruct \ - --dtype=float \ - other vLLM OpenAI server arguments + --privileged true \ + --shm-size 4g \ + -p 8000:8000 \ + -e VLLM_CPU_KVCACHE_SPACE= \ + -e VLLM_CPU_OMP_THREADS_BIND= \ + vllm-cpu-env \ + --model meta-llama/Llama-3.2-1B-Instruct \ + --dtype float \ + other vLLM OpenAI server arguments ``` # --8<-- [end:build-image-from-source] diff --git a/docs/getting_started/installation/google_tpu.md b/docs/getting_started/installation/google_tpu.md index 55d69d11fa401..6f09babb3aba0 100644 --- a/docs/getting_started/installation/google_tpu.md +++ b/docs/getting_started/installation/google_tpu.md @@ -85,7 +85,7 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \ | PROJECT_ID | Your Google Cloud project | | ZONE | The GCP zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see [TPU regions and zones] | | ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see [TPU versions]. | -| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images]. | +| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). | | SERVICE_ACCOUNT | The email address for your service account. You can find it in the IAM Cloud Console under *Service Accounts*. For example: `tpu-service-account@.iam.gserviceaccount.com` | Connect to your TPU VM using SSH: @@ -94,6 +94,9 @@ Connect to your TPU VM using SSH: gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE ``` +!!! note + When configuring `RUNTIME_VERSION` ("TPU software version") on GCP, ensure it matches the TPU generation you've selected by referencing the [TPU VM images] compatibility matrix. Using an incompatible version may prevent vLLM from running correctly. + [TPU versions]: https://cloud.google.com/tpu/docs/runtimes [TPU VM images]: https://cloud.google.com/tpu/docs/runtimes [TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones diff --git a/docs/getting_started/installation/gpu/cuda.inc.md b/docs/getting_started/installation/gpu/cuda.inc.md index 5ca5296d0a657..69a9842e4719b 100644 --- a/docs/getting_started/installation/gpu/cuda.inc.md +++ b/docs/getting_started/installation/gpu/cuda.inc.md @@ -20,16 +20,16 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I # --8<-- [end:set-up-using-python] # --8<-- [start:pre-built-wheels] -You can install vLLM using either `pip` or `uv pip`: - ```bash -# Install vLLM with CUDA 12.8. -# If you are using pip. -pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128 -# If you are using uv. uv pip install vllm --torch-backend=auto ``` +??? console "pip" + ```bash + # Install vLLM with CUDA 12.8. + pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128 + ``` + We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`). If this doesn't work, try running `uv self update` to update `uv` first. !!! note @@ -38,10 +38,10 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions: ```bash -# Install vLLM with CUDA 11.8. -export VLLM_VERSION=0.6.1.post1 -export PYTHON_VERSION=312 -uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118 +# Install vLLM with a specific CUDA version (e.g., 11.8 or 12.6). +export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/releases/latest | jq -r .tag_name | sed 's/^v//') +export CUDA_VERSION=118 # or 126 +uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION} ``` [](){ #install-the-latest-code } @@ -50,36 +50,22 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on a x86 platform with CUDA 12 for every commit since `v0.5.3`. -##### Install the latest code using `pip` - -```bash -pip install -U vllm \ - --pre \ - --extra-index-url https://wheels.vllm.ai/nightly -``` - -`--pre` is required for `pip` to consider pre-released versions. - -Another way to install the latest code is to use `uv`: - ```bash uv pip install -U vllm \ --torch-backend=auto \ --extra-index-url https://wheels.vllm.ai/nightly ``` -##### Install specific revisions using `pip` +??? console "pip" + ```bash + pip install -U vllm \ + --pre \ + --extra-index-url https://wheels.vllm.ai/nightly + ``` -If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL: + `--pre` is required for `pip` to consider pre-released versions. -```bash -export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch -pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl -``` - -Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in the wheel metadata (the wheels listed in the extra index url have correct versions). Although we don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the wheels are still built with Python 3.8 ABI to keep the same wheel name as before. - -##### Install specific revisions using `uv` +##### Install specific revisions If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: @@ -92,17 +78,35 @@ uv pip install vllm \ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. +??? note "pip" + If you want to access the wheels for previous commits (e.g. to bisect the behavior change, + performance regression), due to the limitation of `pip`, you have to specify the full URL of the + wheel file by embedding the commit hash in the URL: + + ```bash + export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch + pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl + ``` + + Note that the wheels are built with Python 3.8 ABI (see [PEP + 425](https://peps.python.org/pep-0425/) for more details about ABI), so **they are compatible + with Python 3.8 and later**. The version string in the wheel file name (`1.0.0.dev`) is just a + placeholder to have a unified URL for the wheels, the actual versions of wheels are contained in + the wheel metadata (the wheels listed in the extra index url have correct versions). Although we + don't support Python 3.8 any more (because PyTorch 2.5 dropped support for Python 3.8), the + wheels are still built with Python 3.8 ABI to keep the same wheel name as before. + # --8<-- [end:pre-built-wheels] # --8<-- [start:build-wheel-from-source] #### Set up using Python-only build (without compilation) -If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM: +If you only need to change Python code, you can build and install vLLM without compilation. Using `uv pip`'s [`--editable` flag](https://docs.astral.sh/uv/pip/packages/#editable-packages), changes you make to the code will be reflected when you run vLLM: ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -VLLM_USE_PRECOMPILED=1 pip install --editable . +VLLM_USE_PRECOMPILED=1 uv pip install --editable . ``` This command will do the following: @@ -121,7 +125,7 @@ In case you see an error about wheel not found when running the above command, i ```bash export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl -pip install --editable . +uv pip install --editable . ``` You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code]. @@ -137,7 +141,7 @@ If you want to modify C++ or CUDA code, you'll need to build vLLM from source. T ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -pip install -e . +uv pip install -e . ``` !!! tip @@ -152,14 +156,14 @@ pip install -e . The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`. !!! note "Faster Kernel Development" - For frequent C++/CUDA kernel changes, after the initial `pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code. + For frequent C++/CUDA kernel changes, after the initial `uv pip install -e .` setup, consider using the [Incremental Compilation Workflow](../../contributing/incremental_build.md) for significantly faster rebuilds of only the modified kernel code. ##### Use an existing PyTorch installation -There are scenarios where the PyTorch dependency cannot be easily installed via pip, e.g.: +There are scenarios where the PyTorch dependency cannot be easily installed with `uv`, e.g.: - Building vLLM with PyTorch nightly or a custom PyTorch build. -- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124` to [install PyTorch nightly](https://pytorch.org/get-started/locally/), and then build vLLM on top of it. +- Building vLLM with aarch64 and CUDA (GH200), where the PyTorch wheels are not available on PyPI. Currently, only the PyTorch nightly has wheels for aarch64 with CUDA. You can run `uv pip install --index-url https://download.pytorch.org/whl/nightly/cu128 torch torchvision torchaudio` to [install PyTorch nightly](https://pytorch.org/get-started/locally/) and then build vLLM on top of it. To build vLLM using an existing PyTorch installation: @@ -167,8 +171,8 @@ To build vLLM using an existing PyTorch installation: git clone https://github.com/vllm-project/vllm.git cd vllm python use_existing_torch.py -pip install -r requirements/build.txt -pip install --no-build-isolation -e . +uv pip install -r requirements/build.txt +uv pip install --no-build-isolation -e . ``` ##### Use the local cutlass for compilation @@ -179,7 +183,7 @@ To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to po ```bash git clone https://github.com/vllm-project/vllm.git cd vllm -VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e . +VLLM_CUTLASS_SRC_DIR=/path/to/cutlass uv pip install -e . ``` ##### Troubleshooting @@ -189,7 +193,7 @@ to be run simultaneously, via the environment variable `MAX_JOBS`. For example: ```bash export MAX_JOBS=6 -pip install -e . +uv pip install -e . ``` This is especially useful when you are building on less powerful machines. For example, when you use WSL it only [assigns 50% of the total memory by default](https://learn.microsoft.com/en-us/windows/wsl/wsl-config#main-wsl-settings), so using `export MAX_JOBS=1` can avoid compiling multiple files simultaneously and running out of memory. @@ -228,7 +232,7 @@ Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing: ```bash export VLLM_TARGET_DEVICE=empty -pip install -e . +uv pip install -e . ``` # --8<-- [end:build-wheel-from-source] diff --git a/docs/getting_started/installation/intel_gaudi.md b/docs/getting_started/installation/intel_gaudi.md index 0be0d02d0679c..61b2b02aa10ba 100644 --- a/docs/getting_started/installation/intel_gaudi.md +++ b/docs/getting_started/installation/intel_gaudi.md @@ -339,13 +339,13 @@ Each described step is logged by vLLM server, as follows (negative values corres - `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism - * `{phase}` is either `PROMPT` or `DECODE` + - `{phase}` is either `PROMPT` or `DECODE` - * `{dim}` is either `BS`, `SEQ` or `BLOCK` + - `{dim}` is either `BS`, `SEQ` or `BLOCK` - * `{param}` is either `MIN`, `STEP` or `MAX` + - `{param}` is either `MIN`, `STEP` or `MAX` - * Default values: + - Default values: | `{phase}` | Parameter | Env Variable | Value Expression | |-----------|-----------|--------------|------------------| diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 74235db16a15d..f833807666460 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -98,6 +98,43 @@ for output in outputs: print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` +!!! note + The `llm.generate` method does not automatically apply the model's chat template to the input prompt. Therefore, if you are using an Instruct model or Chat model, you should manually apply the corresponding chat template to ensure the expected behavior. Alternatively, you can use the `llm.chat` method and pass a list of messages which have the same format as those passed to OpenAI's `client.chat.completions`: + + ??? code + + ```python + # Using tokenizer to apply chat template + from transformers import AutoTokenizer + + tokenizer = AutoTokenizer.from_pretrained("/path/to/chat_model") + messages_list = [ + [{"role": "user", "content": prompt}] + for prompt in prompts + ] + texts = tokenizer.apply_chat_template( + messages_list, + tokenize=False, + add_generation_prompt=True, + ) + + # Generate outputs + outputs = llm.generate(texts, sampling_params) + + # Print the outputs. + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + + # Using chat interface. + outputs = llm.chat(messages_list, sampling_params) + for idx, output in enumerate(outputs): + prompt = prompts[idx] + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + ``` + [](){ #quickstart-online } ## OpenAI-Compatible Server @@ -126,6 +163,7 @@ curl http://localhost:8000/v1/models ``` You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY` to enable the server to check for API key in the header. +You can pass multiple keys after `--api-key`, and the server will accept any of the keys passed, this can be useful for key rotation. ### OpenAI Completions API with vLLM diff --git a/docs/models/hardware_supported_models/tpu.md b/docs/models/hardware_supported_models/tpu.md index da03a3b3160ad..7b0a5ba6e72da 100644 --- a/docs/models/hardware_supported_models/tpu.md +++ b/docs/models/hardware_supported_models/tpu.md @@ -1,7 +1,8 @@ # TPU -# TPU Supported Models -## Text-only Language Models +## Supported Models + +### Text-only Language Models | Model | Architecture | Supported | |-----------------------------------------------------|--------------------------------|-----------| diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index a06d86523af1a..c6588363b63fb 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -7,9 +7,9 @@ These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract before returning them. !!! note - We currently support pooling models primarily as a matter of convenience. - As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to - pooling models as they only work on the generation or decode stage, so performance may not improve as much. + We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly. + + We are now planning to optimize pooling models in vLLM. Please comment on if you have any suggestions! ## Configuration @@ -45,14 +45,14 @@ Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks], enabling the corresponding APIs: -| Task | APIs | -|------------|--------------------| -| `encode` | `encode` | -| `embed` | `embed`, `score`\* | -| `classify` | `classify` | -| `score` | `score` | +| Task | APIs | +|------------|--------------------------------------| +| `encode` | `LLM.reward(...)` | +| `embed` | `LLM.embed(...)`, `LLM.score(...)`\* | +| `classify` | `LLM.classify(...)` | +| `score` | `LLM.score(...)` | -\* The `score` API falls back to `embed` task if the model does not support `score` task. +\* The `LLM.score(...)` API falls back to `embed` task if the model does not support `score` task. ### Pooler Configuration @@ -66,11 +66,11 @@ you can override some of its attributes via the `--override-pooler-config` optio If the model has been converted via `--convert` (see above), the pooler assigned to each task has the following attributes by default: -| Task | Pooling Type | Normalization | Softmax | -|------------|----------------|---------------|---------| -| `encode` | `ALL` | ❌ | ❌ | -| `embed` | `LAST` | ✅︎ | ❌ | -| `classify` | `LAST` | ❌ | ✅︎ | +| Task | Pooling Type | Normalization | Softmax | +|------------|--------------|---------------|---------| +| `reward` | `ALL` | ❌ | ❌ | +| `embed` | `LAST` | ✅︎ | ❌ | +| `classify` | `LAST` | ❌ | ✅︎ | When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults. @@ -83,21 +83,6 @@ which takes priority over both the model's and Sentence Transformers's defaults. The [LLM][vllm.LLM] class provides various methods for offline inference. See [configuration][configuration] for a list of options when initializing the model. -### `LLM.encode` - -The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM. -It returns the extracted hidden states directly, which is useful for reward models. - -```python -from vllm import LLM - -llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling") -(output,) = llm.encode("Hello, my name is") - -data = output.outputs.data -print(f"Data: {data!r}") -``` - ### `LLM.embed` The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt. @@ -106,7 +91,7 @@ It is primarily designed for embedding models. ```python from vllm import LLM -llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling") +llm = LLM(model="intfloat/e5-small", runner="pooling") (output,) = llm.embed("Hello, my name is") embeds = output.outputs.embedding @@ -135,7 +120,7 @@ A code example can be found here: +### `LLM.reward` + +The [reward][vllm.LLM.reward] method is available to all reward models in vLLM. +It returns the extracted hidden states directly. + +```python +from vllm import LLM + +llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True) +(output,) = llm.reward("Hello, my name is") + +data = output.outputs.data +print(f"Data: {data!r}") +``` + +A code example can be found here: + +### `LLM.encode` + +The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM. +It returns the extracted hidden states directly. + +!!! note + Please use one of the more specific methods or set the task directly when using `LLM.encode`: + + - For embeddings, use `LLM.embed(...)` or `pooling_task="embed"`. + - For classification logits, use `LLM.classify(...)` or `pooling_task="classify"`. + - For rewards, use `LLM.reward(...)` or `pooling_task="reward"`. + - For similarity scores, use `LLM.score(...)`. + +```python +from vllm import LLM + +llm = LLM(model="intfloat/e5-small", runner="pooling") +(output,) = llm.encode("Hello, my name is", pooling_task="embed") + +data = output.outputs.data +print(f"Data: {data!r}") +``` + ## Online Serving Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs: diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 763dd8fd50451..017a339ffca0c 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -45,10 +45,10 @@ If a model is neither supported natively by vLLM or Transformers, it can still b For a model to be compatible with the Transformers backend for vLLM it must: - be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)): - * The model directory must have the correct structure (e.g. `config.json` is present). - * `config.json` must contain `auto_map.AutoModel`. + - The model directory must have the correct structure (e.g. `config.json` is present). + - `config.json` must contain `auto_map.AutoModel`. - be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]): - * Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`). + - Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`). If the compatible model is: @@ -134,10 +134,10 @@ class MyConfig(PretrainedConfig): - `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported). - `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s: - * You only need to do this for layers which are not present on all pipeline stages - * vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages - * The `list` in the first element of the `tuple` contains the names of the input arguments - * The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code + - You only need to do this for layers which are not present on all pipeline stages + - vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages + - The `list` in the first element of the `tuple` contains the names of the input arguments + - The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code ## Loading a Model @@ -255,7 +255,7 @@ export https_proxy=http://your.proxy.server:port https_proxy=http://your.proxy.server:port huggingface-cli download # or use vllm cmd directly -https_proxy=http://your.proxy.server:port vllm serve --disable-log-requests +https_proxy=http://your.proxy.server:port vllm serve ``` - Set the proxy in Python interpreter: @@ -311,6 +311,8 @@ See [this page](generative_models.md) for more information on how to use generat #### Text Generation +These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API. +