[Docs] Add some details about what the MoE block needs for the Transformers backend (#28588)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-16 14:07:13 +08:00 · 2025-11-13 03:12:14 +00:00 · 2025-11-13 03:12:14 +00:00 · 3226283461
commit 3226283461
parent 8832fff972
1 changed files with 23 additions and 1 deletions
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@ -75,7 +75,12 @@ This section details the necessary modifications to make to a Transformers compa
 To make your model compatible with the Transformers backend, it needs:

 1. `kwargs` passed down through all modules from `MyModel` to `MyAttention`.
-    1. If your model is encoder-only, you must also add `is_causal = False` to `MyAttention`.
+    - If your model is encoder-only:
+        1. Add `is_causal = False` to `MyAttention`.
+    - If your model is mixture-of-experts (MoE):
+        1. Your sparse MoE block must have an attribute called `experts`.
+        2. The class of `experts` (`MyExperts`) must inherit from `nn.ModuleList`.
+        3. `MyExperts.forward` must accept `hidden_states`, `top_k_index`, `top_k_weights`.
 2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
 3. `MyModel` must contain `_supports_attention_backend = True`.

@ -102,6 +107,23 @@ class MyAttention(nn.Module):
        )
        ...

+# Only do this for mixture-of-experts models
+class MyExperts(nn.ModuleList):
+    def forward(self, hidden_states, top_k_index, top_k_weights):
+        ...
+
+# Only do this for mixture-of-experts models
+class MySparseMoEBlock(nn.Module):
+    def __init__(self, config):
+        ...
+        self.experts = MyExperts(config)
+        ...
+
+    def forward(self, hidden_states: torch.Tensor):
+        ...
+        hidden_states = self.experts(hidden_states, top_k_index, top_k_weights)
+        ...
+
 class MyModel(PreTrainedModel):
    _supports_attention_backend = True
 ```