[Docs] Enable fail_on_warning for the docs build in CI (#25580)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor 2025-09-24 20:30:33 +01:00 committed by GitHub
parent f84a472a03
commit 8c853050e7
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
20 changed files with 81 additions and 87 deletions

View File

@ -13,6 +13,7 @@ build:
mkdocs: mkdocs:
configuration: mkdocs.yaml configuration: mkdocs.yaml
fail_on_warning: true
# Optionally declare the Python requirements required to build your docs # Optionally declare the Python requirements required to build your docs
python: python:

View File

@ -9,7 +9,7 @@ NixlConnector is a high-performance KV cache transfer connector for vLLM's disag
Install the NIXL library: `uv pip install nixl`, as a quick start. Install the NIXL library: `uv pip install nixl`, as a quick start.
- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions - Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files - The specified required NIXL version can be found in [requirements/kv_connectors.txt](gh-file:requirements/kv_connectors.txt) and other relevant config files
### Transport Configuration ### Transport Configuration
@ -154,6 +154,6 @@ python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
Refer to these example scripts in the vLLM repository: Refer to these example scripts in the vLLM repository:
- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) - [run_accuracy_test.sh](gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py) - [toy_proxy_server.py](gh-file:tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py) - [test_accuracy.py](gh-file:tests/v1/kv_connector/nixl_integration/test_accuracy.py)

View File

@ -32,8 +32,9 @@ def auto_mock(module, attr, max_mocks=50):
for _ in range(max_mocks): for _ in range(max_mocks):
try: try:
# First treat attr as an attr, then as a submodule # First treat attr as an attr, then as a submodule
return getattr(importlib.import_module(module), attr, with patch("importlib.metadata.version", return_value="0.0.0"):
importlib.import_module(f"{module}.{attr}")) return getattr(importlib.import_module(module), attr,
importlib.import_module(f"{module}.{attr}"))
except importlib.metadata.PackageNotFoundError as e: except importlib.metadata.PackageNotFoundError as e:
raise e raise e
except ModuleNotFoundError as e: except ModuleNotFoundError as e:

View File

@ -4,7 +4,7 @@ vLLM provides first-class support for generative models, which covers most of LL
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface. In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.sampler.Sampler] to obtain the final text. which are then passed through [Sampler][vllm.v1.sample.sampler.Sampler] to obtain the final text.
## Configuration ## Configuration

View File

@ -29,7 +29,7 @@ _*Vision-language models currently accept only image inputs. Support for video i
If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers backend, it will be compatible with the following features of vLLM: If the Transformers model implementation follows all the steps in [writing a custom model](#writing-custom-models) then, when used with the Transformers backend, it will be compatible with the following features of vLLM:
- All the features listed in the [compatibility matrix](../features/compatibility_matrix.md#feature-x-feature) - All the features listed in the [compatibility matrix](../features/README.md#feature-x-feature)
- Any combination of the following vLLM parallelisation schemes: - Any combination of the following vLLM parallelisation schemes:
- Pipeline parallel - Pipeline parallel
- Tensor parallel - Tensor parallel

View File

@ -1,6 +1,6 @@
# Using vLLM # Using vLLM
First, vLLM must be [installed](../getting_started/installation) for your chosen device in either a Python or Docker environment. First, vLLM must be [installed](../getting_started/installation/) for your chosen device in either a Python or Docker environment.
Then, vLLM supports the following usage patterns: Then, vLLM supports the following usage patterns:

View File

@ -11,9 +11,9 @@ vLLM performance and metrics.
## Dashboard Descriptions ## Dashboard Descriptions
- **[performance_statistics.json](./performance_statistics.json)**: Tracks performance metrics including latency and - **performance_statistics.json**: Tracks performance metrics including latency and
throughput for your vLLM service. throughput for your vLLM service.
- **[query_statistics.json](./query_statistics.json)**: Tracks query performance, request volume, and key - **query_statistics.json**: Tracks query performance, request volume, and key
performance indicators for your vLLM service. performance indicators for your vLLM service.
## Deployment Options ## Deployment Options

View File

@ -21,9 +21,9 @@ deployment methods:
## Dashboard Descriptions ## Dashboard Descriptions
- **[performance_statistics.yaml](./performance_statistics.yaml)**: Performance metrics with aggregated latency - **performance_statistics.yaml**: Performance metrics with aggregated latency
statistics statistics
- **[query_statistics.yaml](./query_statistics.yaml)**: Query performance and deployment metrics - **query_statistics.yaml**: Query performance and deployment metrics
## Deployment Options ## Deployment Options

View File

@ -18,12 +18,14 @@ def _correct_attn_cp_out_kernel(outputs_ptr, new_output_ptr, lses_ptr,
final attention output. final attention output.
Args: Args:
output: [ B, H, D ] outputs_ptr (triton.PointerType):
lses : [ N, B, H ] Pointer to input tensor of shape [ B, H, D ]
cp, batch, q_heads, v_head_dim lses_ptr (triton.PointerType):
Return: Pointer to input tensor of shape [ N, B, H ]
output: [ B, H, D ] new_output_ptr (triton.PointerType):
lse : [ B, H ] Pointer to output tensor of shape [ B, H, D ]
vlse_ptr (triton.PointerType):
Pointer to output tensor of shape [ B, H ]
""" """
batch_idx = tl.program_id(axis=0).to(tl.int64) batch_idx = tl.program_id(axis=0).to(tl.int64)
head_idx = tl.program_id(axis=1).to(tl.int64) head_idx = tl.program_id(axis=1).to(tl.int64)
@ -81,19 +83,19 @@ class CPTritonContext:
self.inner_kernel[grid](*regular_args) self.inner_kernel[grid](*regular_args)
def correct_attn_out(out: torch.Tensor, lses: torch.Tensor, cp_rank: int, def correct_attn_out(
ctx: CPTritonContext): out: torch.Tensor, lses: torch.Tensor, cp_rank: int,
""" ctx: CPTritonContext) -> tuple[torch.Tensor, torch.Tensor]:
Apply the all-gathered lses to correct each local rank's attention """Correct the attention output using the all-gathered lses.
output. we still need perform a cross-rank reduction to obtain the
final attention output.
Args: Args:
output: [ B, H, D ] out: Tensor of shape [ B, H, D ]
lses : [ N, B, H ] lses: Tensor of shape [ N, B, H ]
Return: cp_rank: Current rank in the context-parallel group
output: [ B, H, D ] ctx: Triton context to avoid recompilation
lse : [ B, H ]
Returns:
Tuple of (out, lse) with corrected attention and final log-sum-exp.
""" """
if ctx is None: if ctx is None:
ctx = CPTritonContext() ctx = CPTritonContext()

View File

@ -287,8 +287,8 @@ class EncoderDecoderInputs(TypedDict):
SingletonInputs = Union[TokenInputs, EmbedsInputs, "MultiModalInputs"] SingletonInputs = Union[TokenInputs, EmbedsInputs, "MultiModalInputs"]
""" """
A processed [`SingletonPrompt`][vllm.inputs.data.SingletonPrompt] which can be A processed [`SingletonPrompt`][vllm.inputs.data.SingletonPrompt] which can be
passed to [`vllm.sequence.Sequence`][]. passed to [`Sequence`][collections.abc.Sequence].
""" """
ProcessorInputs = Union[DecoderOnlyInputs, EncoderDecoderInputs] ProcessorInputs = Union[DecoderOnlyInputs, EncoderDecoderInputs]

View File

@ -57,7 +57,7 @@ else:
FusedMoEPermuteExpertsUnpermute = None # type: ignore FusedMoEPermuteExpertsUnpermute = None # type: ignore
FusedMoEPrepareAndFinalize = None # type: ignore FusedMoEPrepareAndFinalize = None # type: ignore
def eplb_map_to_physical_and_record( def _eplb_map_to_physical_and_record(
topk_ids: torch.Tensor, expert_load_view: torch.Tensor, topk_ids: torch.Tensor, expert_load_view: torch.Tensor,
logical_to_physical_map: torch.Tensor, logical_to_physical_map: torch.Tensor,
logical_replica_count: torch.Tensor, logical_replica_count: torch.Tensor,
@ -65,6 +65,7 @@ else:
# CPU fallback: no EPLB so just return as is # CPU fallback: no EPLB so just return as is
return topk_ids return topk_ids
eplb_map_to_physical_and_record = _eplb_map_to_physical_and_record
if is_rocm_aiter_moe_enabled(): if is_rocm_aiter_moe_enabled():
from vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe import ( # noqa: E501 from vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe import ( # noqa: E501
@ -807,12 +808,11 @@ def maybe_roundup_hidden_size(
if necessary. if necessary.
Args: Args:
hidden_size(int): Layer hidden-size hidden_size: Layer hidden-size
act_dtype: Data type of the layer activations. act_dtype: Data type of the layer activations.
quant_config(FusedMoEQuantConfig): Fused MoE quantization configuration. quant_config: Fused MoE quantization configuration.
moe_parallel_config(FusedMoEParallelConfig): Fused MoE parallelization moe_parallel_config: Fused MoE parallelization strategy configuration.
strategy configuration.
Return: Return:
Rounded up hidden_size if rounding up is required based on the configs. Rounded up hidden_size if rounding up is required based on the configs.
Original hidden size otherwise. Original hidden size otherwise.

View File

@ -13,7 +13,7 @@ from collections import defaultdict
from collections.abc import Generator from collections.abc import Generator
from contextlib import contextmanager from contextlib import contextmanager
from pathlib import Path from pathlib import Path
from typing import Any, Callable, Optional, Union from typing import IO, Any, Callable, Optional, Union
import filelock import filelock
import huggingface_hub.constants import huggingface_hub.constants
@ -102,7 +102,7 @@ def get_lock(model_name_or_path: Union[str, Path],
@contextmanager @contextmanager
def atomic_writer(filepath: Union[str, Path], def atomic_writer(filepath: Union[str, Path],
mode: str = 'w', mode: str = 'w',
encoding: Optional[str] = None): encoding: Optional[str] = None) -> Generator[IO]:
""" """
Context manager that provides an atomic file writing routine. Context manager that provides an atomic file writing routine.

View File

@ -1445,14 +1445,18 @@ class Qwen3VLForConditionalGeneration(nn.Module, SupportsMultiModal,
**NOTE**: If mrope is enabled (default setting for Qwen3VL **NOTE**: If mrope is enabled (default setting for Qwen3VL
opensource models), the shape will be `(3, seq_len)`, opensource models), the shape will be `(3, seq_len)`,
otherwise it will be `(seq_len,). otherwise it will be `(seq_len,).
pixel_values: Pixel values to be fed to a model. intermediate_tensors: Intermediate tensors from previous pipeline
`None` if no images are passed. stages.
image_grid_thw: Tensor `(n_images, 3)` of image 3D grid in LLM. inputs_embeds: Pre-computed input embeddings.
`None` if no images are passed. **kwargs: Additional keyword arguments including:
pixel_values_videos: Pixel values of videos to be fed to a model. - pixel_values: Pixel values to be fed to a model.
`None` if no videos are passed. `None` if no images are passed.
video_grid_thw: Tensor `(n_videos, 3)` of video 3D grid in LLM. - image_grid_thw: Tensor `(n_images, 3)` of image 3D grid in
`None` if no videos are passed. LLM. `None` if no images are passed.
- pixel_values_videos: Pixel values of videos to be fed to a
model. `None` if no videos are passed.
- video_grid_thw: Tensor `(n_videos, 3)` of video 3D grid in
LLM. `None` if no videos are passed.
""" """
if intermediate_tensors is not None: if intermediate_tensors is not None:

View File

@ -944,11 +944,10 @@ class Zamba2ForCausalLM(nn.Module, HasInnerState, IsHybrid):
hidden_states: torch.Tensor, hidden_states: torch.Tensor,
) -> Optional[torch.Tensor]: ) -> Optional[torch.Tensor]:
"""Compute logits for next token prediction. """Compute logits for next token prediction.
Args: Args:
hidden_states: Hidden states from model forward pass hidden_states: Hidden states from model forward pass
sampling_metadata: Metadata for sampling process
Returns: Returns:
Logits for next token prediction Logits for next token prediction
""" """

View File

@ -278,11 +278,11 @@ class GraniteReasoningParser(ReasoningParser):
content and normal (response) content. content and normal (response) content.
Args: Args:
delta_text (str): Text to consider and parse content from. delta_text: Text to consider and parse content from.
reasoning_content (str): reasoning content from current_text. reasoning_content: reasoning content from current_text.
response_content (str): response content from current_text. response_content: response content from current_text.
current_text (str): The full previous + delta text. current_text: The full previous + delta text.
response_seq_len(str): Len of the complete response sequence used. response_seq_len: Len of the complete response sequence used.
Returns: Returns:
DeltaMessage: Message containing the parsed content. DeltaMessage: Message containing the parsed content.

View File

@ -27,36 +27,23 @@ class RadioConfig(PretrainedConfig):
specified arguments, defining the model architecture. specified arguments, defining the model architecture.
Args: Args:
model_name (`str`, *optional*, defaults to "vit_base_patch16_224"): model_name: Name of the vision transformer model
Name of the vision transformer model (e.g., "vit_base_patch16_224"). (e.g., "vit_base_patch16_224"). Used to determine architecture
Used to determine architecture dimensions from dimensions from `VIT_TIMM_DIM_BY_NAME`.
`VIT_TIMM_DIM_BY_NAME`. image_size: The size (resolution) of each image.
image_size (`int`, *optional*, defaults to 224): patch_size: The size (resolution) of each patch.
The size (resolution) of each image. qkv_bias: Whether to add a bias to the queries, keys and values.
patch_size (`int`, *optional*, defaults to 16): qk_normalization: Whether to apply normalization to queries and keys.
The size (resolution) of each patch. norm_type: The normalization type to use.
qkv_bias (`bool`, *optional*, defaults to True): layer_norm_eps: The epsilon used by the layer normalization layers.
Whether to add a bias to the queries, keys and values. initializer_factor: A factor for initializing all weight matrices.
qk_normalization (`bool`, *optional*, defaults to False): hidden_act: The non-linear activation function in the encoder.
Whether to apply normalization to queries and keys. max_img_size: Maximum image size for position embeddings.
norm_type (`str`, *optional*, defaults to "layer_norm"): norm_mean: Mean values for image normalization (RGB channels).
The normalization type to use. Defaults to (0.48145466, 0.4578275, 0.40821073)).
layer_norm_eps (`float`, *optional*, defaults to 1e-6): norm_std: Standard deviation values for image normalization
The epsilon used by the layer normalization layers. (RGB channels). Defaults to (0.26862954, 0.26130258, 0.27577711)).
initializer_factor (`float`, *optional*, defaults to 1.0): reg_tokens: Number of register tokens to use.
A factor for initializing all weight matrices.
hidden_act (`str`, *optional*, defaults to "gelu"):
The non-linear activation function in the encoder.
max_img_size (`int`, *optional*, defaults to 2048):
Maximum image size for position embeddings.
norm_mean (`tuple` or `list`, *optional*,
defaults to (0.48145466, 0.4578275, 0.40821073)):
Mean values for image normalization (RGB channels).
norm_std (`tuple` or `list`, *optional*,
defaults to (0.26862954, 0.26130258, 0.27577711)):
Standard deviation values for image normalization (RGB channels).
reg_tokens (`int`, *optional*):
Number of register tokens to use.
""" """
model_type = "radio" model_type = "radio"

View File

@ -27,7 +27,7 @@ def try_get_class_from_dynamic_module(
**kwargs, **kwargs,
) -> Optional[type]: ) -> Optional[type]:
""" """
As [transformers.dynamic_module_utils.get_class_from_dynamic_module][], As `transformers.dynamic_module_utils.get_class_from_dynamic_module`,
but ignoring any errors. but ignoring any errors.
""" """
try: try:

View File

View File

View File