[doc] update the code to add models (#10603)

Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-05-31 16:47:11 +08:00 · 2024-11-24 11:21:40 -08:00 · 2024-11-24 11:21:40 -08:00 · e4fbb14414
commit e4fbb14414
parent c055747867
1 changed files with 53 additions and 24 deletions
--- a/docs/source/models/adding_model.rst
+++ b/docs/source/models/adding_model.rst
@ -38,41 +38,70 @@ For instance, vLLM's `OPT model <https://github.com/vllm-project/vllm/blob/main/
    When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
-2. Rewrite the :code:`forward` methods
+2. Make your code compatible with vLLM
 --------------------------------------
-Next, you need to rewrite the :meth:`~torch.nn.Module.forward` method of your model by following these steps:
+To ensure compatibility with vLLM, your model must meet the following requirements:
-1. Remove any unnecessary code, such as the code only used for training.
+Initialization Code
-2. Change the input parameters:
+^^^^^^^^^^^^^^^^^^^
-.. code-block:: diff
+All vLLM modules within the model must include a ``prefix`` argument in their constructor. This ``prefix`` is typically the full name of the module in the model's state dictionary and is crucial for:
-      def forward(
+* Runtime support: vLLM's attention operators are registered in a model's state by their full names. Each attention operator must have a unique prefix as its layer name to avoid conflicts.
-          self,
+* Non-uniform quantization support: A quantized checkpoint can selectively quantize certain layers while keeping others in full precision. By providing the ``prefix`` during initialization, vLLM can match the current layer's ``prefix`` with the quantization configuration to determine if the layer should be initialized in quantized mode.
          input_ids: torch.Tensor,
    -     attention_mask: Optional[torch.Tensor] = None,
    -     position_ids: Optional[torch.LongTensor] = None,
    -     past_key_values: Optional[List[torch.FloatTensor]] = None,
    -     inputs_embeds: Optional[torch.FloatTensor] = None,
    -     labels: Optional[torch.LongTensor] = None,
    -     use_cache: Optional[bool] = None,
    -     output_attentions: Optional[bool] = None,
    -     output_hidden_states: Optional[bool] = None,
    -     return_dict: Optional[bool] = None,
    - ) -> Union[Tuple, CausalLMOutputWithPast]:
    +     positions: torch.Tensor,
    +     kv_caches: List[torch.Tensor],
    +     attn_metadata: AttentionMetadata,
    + ) -> Optional[SamplerOutput]:
-1. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
+The initialization code should look like this:
-2. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
+
 .. code-block:: python
    from torch import nn
    from vllm.config import VllmConfig
    from vllm.attention import Attention
    class MyAttention(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str):
            super().__init__()
            self.attn = Attention(prefix=f"{prefix}.attn")
    class MyDecoderLayer(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str):
            super().__init__()
            self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
    class MyModel(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str):
            super().__init__()
            self.layers = nn.ModuleList(
                [MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
            )
    class MyModelForCausalLM(nn.Module):
        def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
            super().__init__()
            self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
 Computation Code
 ^^^^^^^^^^^^^^^^
 Rewrite the :meth:`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat ``input_ids`` and ``positions`` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
 .. code-block:: python
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        kv_caches: List[torch.Tensor],
        attn_metadata: AttentionMetadata,
    ) -> torch.Tensor:
        ...
 .. note::
    Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
    If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
 For reference, check out the `LLAMA model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py>`__. vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out the `vLLM models <https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models>`__ directory for more examples.
 3. (Optional) Implement tensor parallelism and quantization support
 -------------------------------------------------------------------