13 KiB
CustomOp
CustomOp is an abstract class used for dispatching the forward method of various operations to the appropriate backend. It also offers a mechanism for both vLLM and OOT (Out-Of-Tree) plugins to register their custom operations.
This document will introduce how CustomOp works in vLLM and how to implement a new CustomOp.
How CustomOp Works in vLLM
CustomOp manages two dictionaries of all custom ops (i.e., op classes, indexed by registered name) in its class, for vLLM and OOT plugins respectively.
??? code
```python
class CustomOp(nn.Module):
op_registry: dict[str, type["CustomOp"]] = {}
op_registry_oot: dict[str, type["CustomOp"]] = {}
```
We can use @CustomOp.register("op_name") to register an op class to the CustomOp system. After this, the op_name and its class will be added into the op_registry dictionary. In addition, We can also register an OOT op by @CustomOp.register_oot("op_name"). We will introduce this mechanism in detail later.
When a CustomOp is called (i.e., call its forward() method), if it is enabled, it will automatically dispatch the forward method to the appropriate backend according to current_platform. Otherwise (i.e., it is disabled), it will only call the forward_native() method to use PyTorch-native implementation of this forward method.
- CPU platform: dispatch to
forward_cpu(). - CUDA platform: dispatch to
forward_cuda(). - ROCm platform: dispatch to
forward_hip(). Ifforward_hip()is not implemented, it will useforward_cuda()as a fallback. - XPU platform: dispatch to
forward_xpu(). - TPU platform: dispatch to
forward_tpu(). - OOT platform: dispatch to
forward_oot(). This will only be called on OOT platforms. - Default: dispatch to
forward_native()as a final fallback for all platforms.
!!! note Note that the dispatching logic might not be absolute because of class inheritance. Derived class might override the behavior.
Furthur more, vLLM decides whether enable or disable a CustomOp by compilation_config.custom_ops. To be specific, if a CustomOp is not registered (i.e., use default config), it will be enabled if there is a all in compilation_config.custom_ops or will be disabled if there is a none.
!!! note
Note that all and none cannot coexist in compilation_config.custom_ops.
By default, if compilation_config.backend == "inductor" and compilation_config.mode != CompilationMode.NONE, a none will be appended into compilation_config.custom_ops, otherwise a all will be appended. In other words, this means CustomOp will be disabled in some platforms (i.e., those use inductor as dafault backend for torch.compile) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops.
!!! note
For multi-modal models, vLLM has enforece enabled some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as MMEncoderAttention and ApplyRotaryEmb. We can also pass a enforce_enable=True param to the __init__() method of the CustomOp to enforce enable itself at object-level.
Note that this `enforce_enable` mechanism will be removed after we adding a separate `compilation_config` for multi-modal part.
How to Customise Your Configuration for CustomOp
vLLM also offers fine-grained control over which custom ops to enable or disable for users, by manually passing a --compilation_config.custom_ops '["..."]' when launching a server.
For example:
- Use
--compilation_config.custom_ops '["all"]'to enable all custom ops. - Use
--compilation_config.custom_ops '["none"]'to disable all custom ops. - Use
--compilation_config.custom_ops '["all,-op1"]'to enable all custom ops except op1 (i.e., prefixed with a-means "disable"). - Use
--compilation_config.custom_ops '["none,+op1,+op2"]'to only enable op1 and op2 (i.e., prefixed with a+means "enable").
Types of Supported CustomOp in vLLM
1. Attention:
--8<-- "vllm/attention/layers/mm_encoder_attention.py:mm_encoder_attn"
--8<-- "vllm/model_executor/layers/mla.py:multi_head_latent_attention"
2. Activation:
--8<-- "vllm/model_executor/layers/activation.py:silu_and_mul"
--8<-- "vllm/model_executor/layers/activation.py:mul_and_silu"
--8<-- "vllm/model_executor/layers/activation.py:gelu_new"
--8<-- "vllm/model_executor/layers/activation.py:gelu_fast"
--8<-- "vllm/model_executor/layers/activation.py:quick_gelu"
--8<-- "vllm/model_executor/layers/activation.py:gelu_and_mul"
--8<-- "vllm/model_executor/layers/activation.py:gelu_and_mul_sparse"
--8<-- "vllm/model_executor/layers/activation.py:relu2"
--8<-- "vllm/model_executor/layers/activation.py:xielu"
--8<-- "vllm/model_executor/layers/activation.py:swigluoai_and_mul"
--8<-- "vllm/model_executor/layers/activation.py:fatrelu_and_mul"
3. MM-Conv:
--8<-- "vllm/model_executor/layers/conv.py:conv2d"
--8<-- "vllm/model_executor/layers/conv.py:conv3d"
4. Embedding:
--8<-- "vllm/model_executor/layers/vocab_parallel_embedding.py:vocab_parallel_embedding"
--8<-- "vllm/model_executor/layers/vocab_parallel_embedding.py:parallel_lm_head"
5. Linear:
--8<-- "vllm/model_executor/layers/linear.py:row_parallel_linear"
--8<-- "vllm/model_executor/layers/linear.py:row_parallel_linear:column_parallel_linear"
--8<-- "vllm/model_executor/layers/linear.py:row_parallel_linear:replicated_linear"
6. Logits Processor:
--8<-- "vllm/model_executor/layers/logits_processor.py:logits_processor"
7. Mamba:
--8<-- "vllm/model_executor/layers/mamba/mamba_mixer.py:mamba_mixer"
--8<-- "vllm/model_executor/layers/mamba/mamba_mixer2.py:mamba_mixer2"
--8<-- "vllm/model_executor/layers/mamba/mamba_mixer2.py:mixer2_gated_rms_norm"
--8<-- "vllm/model_executor/models/plamo2.py:plamo2_mamba_mixer"
--8<-- "vllm/model_executor/layers/mamba/short_conv.py:short_conv"
8. MoE:
--8<-- "vllm/model_executor/layers/fused_moe/layer.py:fused_moe"
--8<-- "vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py:modular_fused_moe"
--8<-- "vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py:unquantized_fused_moe"
--8<-- "vllm/model_executor/models/transformers/moe.py:transformers_fused_moe"
--8<-- "vllm/model_executor/layers/fused_moe/fused_moe.py:grouped_topk"
9. Norm:
--8<-- "vllm/model_executor/layers/layernorm.py:rms_norm"
--8<-- "vllm/model_executor/layers/layernorm.py:rms_norm_gated"
--8<-- "vllm/model_executor/layers/layernorm.py:gemma_rms_norm"
10. Quantization:
--8<-- "vllm/model_executor/layers/quantization/input_quant_fp8.py:quant_fp8"
11. Rope:
--8<-- "vllm/model_executor/layers/rotary_embedding/base.py:rotary_embedding"
--8<-- "vllm/model_executor/layers/rotary_embedding/dual_chunk_rope.py:dual_chunk_rotary_embedding"
--8<-- "vllm/model_executor/layers/rotary_embedding/common.py:apply_rotary_emb"
Guidelines for Implementing a New CustomOp
Implement a New CustomOp in vLLM
This part is a tutorial of how to implement a New CustomOp in vLLM.
Steps:
- Implement a new op class, which extends from
CustomOpbase class. - Add the
@CustomOp.register("op_name")decorator on this op class to register it intoCustomOpsystem. - Implement different
forward_xxx()method according to your needs.
Taking MMEncoderAttention as an example:
??? code
```python
@CustomOp.register("mm_encoder_attn")
class MMEncoderAttention(CustomOp):
def __init__(
self,
num_heads: int,
head_size: int,
scale: float | None = None,
num_kv_heads: int | None = None,
prefix: str = "",
multimodal_config: MultiModalConfig | None = None,
) -> None:
super().__init__()
# Init...
def forward_native(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call TORCH_SDPA implementation...
def forward_cuda(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call FA or TORCH_SDPA implementation...
def forward_cpu(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call TORCH_SDPA implementation...
def forward_xpu(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call FA implementation...
def forward_tpu(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
cu_seqlens: torch.Tensor | None = None,
max_seqlen: torch.Tensor | None = None, # Only used for Flash Attention
) -> torch.Tensor:
# Call PALLAS implementation...
```
Register a New CustomOp in OOT Device Plugins
Currently, thanks to vLLM's hardware-plugin mechanism, there are various OOT device plugins emerging out to enable vLLM seamlessly runs on different hardwares. You can also find more details about this mechanism at Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU.
- Official device plugins: vllm-ascend (for Huawei Ascend NPU), vllm-spyre (for Spyre), vllm-gaudi (for Intel Gaudi), vllm-neuron (for AWS Neuron), vllm-meta (for Apple Silicon), etc.
- Non-official device plugins: vllm-metax (for MetaX GPU), vllm-kunlun (for Baidu Kunlun XPU), etc.
In this case, CustomOp can enable these hardware manufacturers to seamlessly replace vLLM's operations with their deep-optimized kernels for specific devices at runtime, by just registering an OOT CustomOp and implementing the forward_oot() method.
Now, this part will show you how to register an OOT CustomOp for a device plugin.
Taking MMEncoderAttention as an example:
- Implement a
CustomMMEncoderAttentionclass which extends fromMMEncoderAttentionand implement itsforward_oot()method. - Register your
CustomMMEncoderAttentioninto vLLM to replaceMMEncoderAttention.
??? code
```python
from vllm.attention.layers.mm_encoder_attention import MMEncoderAttention
from vllm.model_executor.custom_op import CustomOp
@CustomOp.register_oot("MMEncoderAttention")
class CustomMMEncoderAttention(MMEncoderAttention):
def __init__(...):
super().__init__(...)
def forward_oot(...):
# Call optimized device-specific kernels.
...
```
In this case, a new item {"MMEncoderAttention": CustomMMEncoderAttention} will be added into op_registry_oot. When initializing a MMEncoderAttention op object, if the class name (i.e., MMEncoderAttention) is contained in the keys of op_registry_oot, vLLM will replace it with our registered class (i.e., CustomMMEncoderAttention) and instantiate it.
After that, when this MMEncoderAttention op is called, your forward_oot() will be called if it is enabled. Thus, you will get expected performance on your hardwares without directly modify vLLM.
In addition, you can also register all your CustomOp at one place for better management.
??? code
```python
from vllm.model_executor.custom_op import CustomOp
REGISTERED_CUSTOM_OPS = {
"CustomOP1": YourCustomOp1,
"CustomOP2": YourCustomOp2,
"CustomOP3": YourCustomOp3,
}
for op_name, op_cls in REGISTERED_CUSTOM_OPS.items():
CustomOp.register_oot(_decorated_op_cls=op_cls, name=op_name)
```