Tao Yun 6dcb07f676
support qwen3-vl handle requests with embeddings (#30037)
Signed-off-by: taoyun <1069423820@qq.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-12-04 17:34:06 +00:00
..

Features

Compatibility Matrix

The tables below show mutually exclusive features and the support on some hardware.

The symbols used have the following meanings:

  • = Full compatibility
  • 🟠 = Partial compatibility
  • = No compatibility
  • = Unknown or TBD

!!! note Check the or 🟠 with links to see tracking issue for unsupported feature/hardware combination.

Feature x Feature

Feature CP APC LoRA SD CUDA graph pooling enc-dec logP prmpt logP async output multi-step mm best-of beam-search prompt-embeds
CP
APC
LoRA
SD
CUDA graph
pooling 🟠* 🟠*
enc-dec
logP
prmpt logP
async output
multi-step
mm 🟠^
best-of
beam-search
prompt-embeds

* Chunked prefill and prefix caching are only applicable to last-token or all pooling with causal attention.
^ LoRA is only applicable to the language backbone of multimodal models.

Feature x Hardware

Feature Volta Turing Ampere Ada Hopper CPU AMD Intel GPU
CP
APC
LoRA
SD 🟠
CUDA graph
pooling
enc-dec
mm 🟠
prompt-embeds
logP
prmpt logP
async output
multi-step
best-of
beam-search

!!! note For information on feature support on Google TPU, please refer to the TPU-Inference Recommended Models and Features documentation.