mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-07-30 16:07:56 +08:00

History

support qwen3-vl handle requests with embeddings (#30037 )

Signed-off-by: taoyun <1069423820@qq.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

2025-12-04 17:34:06 +00:00

quantization

Remove VLLM_SKIP_WARMUP tip (#29331 )

2025-11-24 22:16:05 +00:00

automatic_prefix_caching.md

[Docs] Reduce custom syntax used in docs (#27009 )

2025-10-16 20:05:34 -07:00

batch_invariance.md

Batch invariance doc (#27839 )

2025-10-31 17:22:19 -04:00

custom_arguments.md

[Doc]: fix typos in various files (#28811 )

2025-11-16 14:30:06 +00:00

custom_logitsprocs.md

[Doc]: fix typos in various files (#28811 )

2025-11-16 14:30:06 +00:00

disagg_encoder.md

[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233 )

2025-11-11 18:58:33 -08:00

disagg_prefill.md

[Doc]: fix typos in various files (#28863 )

2025-11-17 20:32:14 -08:00

interleaved_thinking.md

[Frontend] supports interleaved thinking (#28531 )

2025-11-13 16:14:13 +08:00

lora.md

[Doc]: fix typos in various files (#28863 )

2025-11-17 20:32:14 -08:00

mooncake_connector_usage.md

[P/D] Introduce Mooncake Transfer Engine as kv_connector (#24718 )

2025-12-04 09:51:36 +00:00

multimodal_inputs.md

support qwen3-vl handle requests with embeddings (#30037 )

2025-12-04 17:34:06 +00:00

nixl_connector_usage.md

[Doc]: fix typos in various files (#29010 )

2025-11-19 04:56:21 -08:00

prompt_embeds.md

[Frontend] Require flag for loading text and image embeds (#27204 )

2025-10-22 15:52:02 +00:00

README.md

[Model][6/N] Improve all pooling task | Support chunked prefill with ALL pooling (#27145 )

2025-12-04 13:44:15 +00:00

reasoning_outputs.md

[Misc] Refactor tokenizer interface (#29693 )

2025-11-29 04:02:21 -08:00

sleep_mode.md

[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695 )

2025-11-12 15:24:12 -08:00

spec_decode.md

[Spec Decode] Integrate Suffix Decoding from Arctic Inference (#25784 )

2025-11-03 09:23:31 -08:00

structured_outputs.md

Scheduled removal of guided_* config fields (#29326 )

2025-11-25 05:24:05 +00:00

tool_calling.md

[Misc] Refactor tokenizer interface (#29693 )

2025-11-29 04:02:21 -08:00

README.md

Features

Compatibility Matrix

The tables below show mutually exclusive features and the support on some hardware.

The symbols used have the following meanings:

✅ = Full compatibility
🟠 = Partial compatibility
❌ = No compatibility
❔ = Unknown or TBD

!!! note Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.

Feature x Feature

Feature	CP	APC	LoRA	SD	CUDA graph	pooling	enc-dec	logP	prmpt logP	async output	multi-step	mm	best-of	beam-search	prompt-embeds
CP	✅
APC	✅	✅
LoRA	✅	✅	✅
SD	✅	✅	❌	✅
CUDA graph	✅	✅	✅	✅	✅
pooling	🟠*	🟠*	✅	❌	✅	✅
enc-dec	❌	❌	❌	❌	✅	✅	✅
logP	✅	✅	✅	✅	✅	❌	✅	✅
prmpt logP	✅	✅	✅	✅	✅	❌	✅	✅	✅
async output	✅	✅	✅	❌	✅	❌	❌	✅	✅	✅
multi-step	❌	✅	❌	❌	✅	❌	❌	✅	✅	✅	✅
mm	✅	✅	🟠^{^}	❔	✅	✅	✅	✅	✅	✅	❔	✅
best-of	✅	✅	✅	❌	✅	❌	✅	✅	✅	❔	❌	✅	✅
beam-search	✅	✅	✅	❌	✅	❌	✅	✅	✅	❔	❌	❔	✅	✅
prompt-embeds	✅	✅	✅	❌	✅	❌	❌	✅	❌	❔	❔	❌	❔	❔	✅

* Chunked prefill and prefix caching are only applicable to last-token or all pooling with causal attention.
^{^} LoRA is only applicable to the language backbone of multimodal models.

Feature x Hardware

Feature	Volta	Turing	Ampere	Ada	Hopper	CPU	AMD	Intel GPU
CP	❌	✅	✅	✅	✅	✅	✅	✅
APC	❌	✅	✅	✅	✅	✅	✅	✅
LoRA	✅	✅	✅	✅	✅	✅	✅	✅
SD	✅	✅	✅	✅	✅	❌	✅	🟠
CUDA graph	✅	✅	✅	✅	✅	❌	✅	❌
pooling	✅	✅	✅	✅	✅	✅	✅	✅
enc-dec	✅	✅	✅	✅	✅	✅	❌	✅
mm	✅	✅	✅	✅	✅	✅	✅	🟠
prompt-embeds	✅	✅	✅	✅	✅	✅	❔	✅
logP	✅	✅	✅	✅	✅	✅	✅	✅
prmpt logP	✅	✅	✅	✅	✅	✅	✅	✅
async output	✅	✅	✅	✅	✅	❌	❌	✅
multi-step	✅	✅	✅	✅	✅	❌	✅	✅
best-of	✅	✅	✅	✅	✅	✅	✅	✅
beam-search	✅	✅	✅	✅	✅	✅	✅	✅

!!! note For information on feature support on Google TPU, please refer to the TPU-Inference Recommended Models and Features documentation.