mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-16 10:07:11 +08:00
[Doc]: fix typos in various files (#28945)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
This commit is contained in:
parent
a4511e38db
commit
7ed27f3cb5
@ -4,7 +4,7 @@ The purpose of this document is to provide an overview of the various MoE kernel
|
||||
|
||||
## Fused MoE Modular All2All backends
|
||||
|
||||
There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the `FusedMoE` layer. The different `FusedMoEPrepareAndFinalize` sub-classes provide an interface for each all2all backend.
|
||||
There are a number of all2all communication backends that are used to implement expert parallelism (EP) for the `FusedMoE` layer. The different `FusedMoEPrepareAndFinalize` subclasses provide an interface for each all2all backend.
|
||||
|
||||
The following table describes the relevant features of each backend, i.e. activation format, supported quantization schemes and async support.
|
||||
|
||||
@ -68,7 +68,7 @@ Modular kernels are supported by the following `FusedMoEMethodBase` classes.
|
||||
|
||||
## Fused MoE Experts Kernels
|
||||
|
||||
The are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adapters so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.
|
||||
There are a number of MoE experts kernel implementations for different quantization types and architectures. Most follow the general API of the base Triton [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts] function. Many have modular kernel adapters so they can be used with compatible all2all backends. This table lists each experts kernel and its particular properties.
|
||||
|
||||
Each kernel must be provided with one of the supported input activation formats. Some flavors of kernels support both standard and batched formats through different entry points, e.g. `TritonExperts` and `BatchedTritonExperts`. Batched format kernels are currently only needed for matching with certain all2all backends, e.g. `pplx`, `DeepEPLLPrepareAndFinalize`.
|
||||
|
||||
|
||||
@ -49,7 +49,7 @@ Every plugin has three parts:
|
||||
|
||||
- **Platform plugins** (with group name `vllm.platform_plugins`): The primary use case for these plugins is to register custom, out-of-the-tree platforms into vLLM. The plugin function should return `None` when the platform is not supported in the current environment, or the platform class's fully qualified name when the platform is supported.
|
||||
|
||||
- **IO Processor plugins** (with group name `vllm.io_processor_plugins`): The primary use case for these plugins is to register custom pre/post processing of the model prompt and model output for pooling models. The plugin function returns the IOProcessor's class fully qualified name.
|
||||
- **IO Processor plugins** (with group name `vllm.io_processor_plugins`): The primary use case for these plugins is to register custom pre-/post-processing of the model prompt and model output for pooling models. The plugin function returns the IOProcessor's class fully qualified name.
|
||||
|
||||
- **Stat logger plugins** (with group name `vllm.stat_logger_plugins`): The primary use case for these plugins is to register custom, out-of-the-tree loggers into vLLM. The entry point should be a class that subclasses StatLoggerBase.
|
||||
|
||||
|
||||
@ -306,7 +306,7 @@ As examples, we provide some ready-to-use quantized mixed precision model to sho
|
||||
|
||||
### 2. inference the quantized mixed precision model in vLLM
|
||||
|
||||
Models quantized with AMD Quark using mixed precision can natively be reload in vLLM, and e.g. evaluated using lm-evaluation-harness as follow:
|
||||
Models quantized with AMD Quark using mixed precision can natively be reload in vLLM, and e.g. evaluated using lm-evaluation-harness as follows:
|
||||
|
||||
```bash
|
||||
lm_eval --model vllm \
|
||||
|
||||
@ -46,7 +46,7 @@ Navigate to [`http://localhost:3000`](http://localhost:3000). Log in with the de
|
||||
|
||||
Navigate to [`http://localhost:3000/connections/datasources/new`](http://localhost:3000/connections/datasources/new) and select Prometheus.
|
||||
|
||||
On Prometheus configuration page, we need to add the `Prometheus Server URL` in `Connection`. For this setup, Grafana and Prometheus are running in separate containers, but Docker creates DNS name for each containers. You can just use `http://prometheus:9090`.
|
||||
On Prometheus configuration page, we need to add the `Prometheus Server URL` in `Connection`. For this setup, Grafana and Prometheus are running in separate containers, but Docker creates DNS name for each container. You can just use `http://prometheus:9090`.
|
||||
|
||||
Click `Save & Test`. You should get a green check saying "Successfully queried the Prometheus API.".
|
||||
|
||||
|
||||
@ -1500,7 +1500,7 @@ class EngineArgs:
|
||||
# Local DP rank = 1, use pure-external LB.
|
||||
if data_parallel_external_lb:
|
||||
assert self.data_parallel_rank is not None, (
|
||||
"data_parallel_rank or node_rank must be spefified if "
|
||||
"data_parallel_rank or node_rank must be specified if "
|
||||
"data_parallel_external_lb is enable."
|
||||
)
|
||||
assert self.data_parallel_size_local in (1, None), (
|
||||
|
||||
@ -1261,7 +1261,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
|
||||
# MoE routing strategy selector.
|
||||
# See `RoutingSimulator.get_available_strategies()` # for available
|
||||
# strategies.
|
||||
# Cutstom routing strategies can be registered by
|
||||
# Custom routing strategies can be registered by
|
||||
# RoutingSimulator.register_strategy()
|
||||
# Note: custom strategies may not produce correct model outputs
|
||||
"VLLM_MOE_ROUTING_SIMULATION_STRATEGY": lambda: os.environ.get(
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user