mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-14 04:24:56 +08:00
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
This commit is contained in:
parent
a403d0fa41
commit
704432af3c
@ -107,14 +107,16 @@ to enable simultaneous generation and embedding using the same engine instance i
|
|||||||
#### Mamba Models
|
#### Mamba Models
|
||||||
|
|
||||||
Models using selective state-space mechanisms instead of standard transformer attention are supported.
|
Models using selective state-space mechanisms instead of standard transformer attention are supported.
|
||||||
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported. Please note that these models currently require disabling prefix caching in V1.
|
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`) are supported.
|
||||||
|
Please note that prefix caching is not yet supported for these models.
|
||||||
|
|
||||||
Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
|
Models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
|
||||||
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`). Please note that
|
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`).
|
||||||
these models currently require disabling prefix caching in V1.
|
Please note that prefix caching is not yet supported for these models.
|
||||||
|
|
||||||
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
|
Hybrid models with mechanisms different to Mamba are also supported (e.g, `MiniMaxText01ForCausalLM`, `MiniMaxM1ForCausalLM`).
|
||||||
Please note that these models currently require disabling prefix caching and enforcing eager mode in V1.
|
Please note that prefix caching is not yet supported for these models.
|
||||||
|
It is also necessary to enforce eager mode for these models in V1.
|
||||||
|
|
||||||
#### Encoder-Decoder Models
|
#### Encoder-Decoder Models
|
||||||
|
|
||||||
|
|||||||
@ -292,12 +292,13 @@ class MambaModelConfig(VerifyAndUpdateConfig):
|
|||||||
return
|
return
|
||||||
|
|
||||||
model_config = vllm_config.model_config
|
model_config = vllm_config.model_config
|
||||||
|
cache_config = vllm_config.cache_config
|
||||||
compilation_config = vllm_config.compilation_config
|
compilation_config = vllm_config.compilation_config
|
||||||
|
|
||||||
model_cls, _ = ModelRegistry.resolve_model_cls(
|
# TODO(tdoublep): remove once prefix caching is enabled
|
||||||
model_config.architecture,
|
cache_config.enable_prefix_caching = False
|
||||||
model_config=model_config,
|
logger.info("Hybrid or mamba-based model detected: disabling prefix "
|
||||||
)
|
"caching since it is not yet supported.")
|
||||||
|
|
||||||
# TODO(tdoublep): remove as full cuda graph support is added
|
# TODO(tdoublep): remove as full cuda graph support is added
|
||||||
FCG_NOT_SUPPORTED_MODELS = [
|
FCG_NOT_SUPPORTED_MODELS = [
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user