mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-09 03:15:00 +08:00
[Doc] Use gh-pr and gh-issue everywhere we can in the docs (#20564)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
6e4bef1bea
commit
45877ef740
@ -7,9 +7,8 @@ release in CI/CD. It is standard practice to submit a PR to update the
|
||||
PyTorch version as early as possible when a new [PyTorch stable
|
||||
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
|
||||
This process is non-trivial due to the gap between PyTorch
|
||||
releases. Using [#16859](https://github.com/vllm-project/vllm/pull/16859) as
|
||||
an example, this document outlines common steps to achieve this update along with
|
||||
a list of potential issues and how to address them.
|
||||
releases. Using <gh-pr:16859> as an example, this document outlines common steps to achieve this
|
||||
update along with a list of potential issues and how to address them.
|
||||
|
||||
## Test PyTorch release candidates (RCs)
|
||||
|
||||
@ -68,7 +67,7 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
|
||||
it doesn't populate the cache, so re-running it to warm up the cache
|
||||
is ineffective.
|
||||
|
||||
While ongoing efforts like [#17419](https://github.com/vllm-project/vllm/issues/17419)
|
||||
While ongoing efforts like [#17419](gh-issue:17419)
|
||||
address the long build time at its source, the current workaround is to set VLLM_CI_BRANCH
|
||||
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
|
||||
when manually triggering a build on Buildkite. This branch accomplishes two things:
|
||||
@ -129,6 +128,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi
|
||||
for different platforms in vLLM CI/CD allows us to selectively choose
|
||||
which platforms to update. For instance, updating XPU requires the corresponding
|
||||
release from https://github.com/intel/intel-extension-for-pytorch by Intel.
|
||||
While https://github.com/vllm-project/vllm/pull/16859 updated vLLM to PyTorch
|
||||
2.7.0 on CPU, CUDA, and ROCm, https://github.com/vllm-project/vllm/pull/17444
|
||||
completed the update for XPU.
|
||||
While <gh-pr:16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
|
||||
<gh-pr:17444> completed the update for XPU.
|
||||
|
||||
@ -217,8 +217,8 @@ an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https
|
||||
A few important things to consider when using the EAGLE based draft models:
|
||||
|
||||
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
|
||||
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
|
||||
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
|
||||
be able to be loaded and used directly by vLLM after <gh-pr:12304>.
|
||||
If you are using vllm version before <gh-pr:12304>, please use the
|
||||
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
|
||||
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
|
||||
|
||||
@ -228,7 +228,7 @@ A few important things to consider when using the EAGLE based draft models:
|
||||
|
||||
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
|
||||
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
|
||||
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
|
||||
investigation and tracked here: <gh-issue:9565>.
|
||||
|
||||
A variety of EAGLE draft models are available on the Hugging Face hub:
|
||||
|
||||
|
||||
@ -212,7 +212,7 @@ if __name__ == '__main__':
|
||||
|
||||
## `torch.compile` Error
|
||||
|
||||
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
|
||||
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
|
||||
|
||||
??? Code
|
||||
|
||||
@ -231,7 +231,7 @@ vLLM heavily depends on `torch.compile` to optimize the model for better perform
|
||||
print(f(x))
|
||||
```
|
||||
|
||||
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See [this issue](https://github.com/vllm-project/vllm/issues/12219) for example.
|
||||
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <gh-issue:12219> for example.
|
||||
|
||||
## Model failed to be inspected
|
||||
|
||||
|
||||
@ -2,7 +2,7 @@
|
||||
|
||||
!!! announcement
|
||||
|
||||
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
|
||||
We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details.
|
||||
|
||||
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
|
||||
|
||||
@ -83,7 +83,7 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
|
||||
| **Decoder-only Models** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Encoder-Decoder Models** | <nobr>🟠 Delayed</nobr> |
|
||||
| **Embedding Models** | <nobr>🟢 Functional</nobr> |
|
||||
| **Mamba Models** | <nobr>🚧 WIP ([PR #19327](https://github.com/vllm-project/vllm/pull/19327))</nobr> |
|
||||
| **Mamba Models** | <nobr>🚧 WIP (<gh-pr:19327>)</nobr> |
|
||||
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
|
||||
|
||||
vLLM V1 currently excludes model architectures with the `SupportsV0Only` protocol.
|
||||
@ -98,14 +98,14 @@ See below for the status of models that are not yet supported or have more featu
|
||||
|
||||
The initial basic support is now functional.
|
||||
|
||||
Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249),
|
||||
which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360)
|
||||
Later, we will consider using [hidden states processor](gh-issue:12249),
|
||||
which is based on [global logits processor](gh-pr:13360)
|
||||
to enable simultaneous generation and embedding using the same engine instance in V1.
|
||||
|
||||
#### Mamba Models
|
||||
|
||||
Models using selective state-space mechanisms instead of standard transformer attention (e.g., `MambaForCausalLM`, `JambaForCausalLM`)
|
||||
will be supported via [PR #19327](https://github.com/vllm-project/vllm/pull/19327).
|
||||
will be supported via <gh-pr:19327>.
|
||||
|
||||
#### Encoder-Decoder Models
|
||||
|
||||
@ -120,13 +120,13 @@ are not yet supported.
|
||||
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
|
||||
| **LoRA** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
|
||||
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices ([PR #15191](https://github.com/vllm-project/vllm/pull/15191))</nobr>|
|
||||
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<gh-pr:15191>)</nobr>|
|
||||
| **Spec Decode** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
|
||||
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](gh-issue:13414))</nobr>|
|
||||
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
|
||||
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
|
||||
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
|
||||
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
|
||||
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](gh-issue:13361))</nobr>|
|
||||
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](gh-pr:13360))</nobr> |
|
||||
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
|
||||
|
||||
!!! note
|
||||
@ -153,7 +153,7 @@ Support for logprobs with post-sampling adjustments is in progress and will be a
|
||||
|
||||
**Prompt Logprobs with Prefix Caching**
|
||||
|
||||
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](https://github.com/vllm-project/vllm/issues/13414).
|
||||
Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414).
|
||||
|
||||
#### Deprecated Features
|
||||
|
||||
@ -161,11 +161,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha
|
||||
|
||||
**Sampling features**
|
||||
|
||||
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
|
||||
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
|
||||
- **Per-Request Logits Processors**: In V0, users could pass custom
|
||||
processing functions to adjust logits on a per-request basis. In vLLM V1, this
|
||||
feature has been deprecated. Instead, the design is moving toward supporting **global logits
|
||||
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
|
||||
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
|
||||
|
||||
**KV Cache features**
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user