fix[Docs]: link anchor is incorrect #20309 (#20315)

Signed-off-by: zxw <1020938856@qq.com>
This commit is contained in:
yyzxw 2025-07-02 14:32:34 +08:00 committed by GitHub
parent 1a03dd496b
commit be0cfb2b68
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 10 additions and 10 deletions

View File

@ -6,7 +6,7 @@ title: Engine Arguments
Engine arguments control the behavior of the vLLM engine. Engine arguments control the behavior of the vLLM engine.
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class. - For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`. - For [online serving][serving-openai-compatible-server], they are part of the arguments to `vllm serve`.
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments. You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.

View File

@ -74,7 +74,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>. That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
More details on the API server can be found in the [OpenAI-Compatible Server][openai-compatible-server] document. More details on the API server can be found in the [OpenAI-Compatible Server][serving-openai-compatible-server] document.
## LLM Engine ## LLM Engine

View File

@ -21,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar. - `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page. You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the may choose to specify the backend to use by setting the

View File

@ -110,7 +110,7 @@ docker run \
### Supported features ### Supported features
- [Offline inference][offline-inference] - [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][openai-compatible-server] - Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server]
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,

View File

@ -134,7 +134,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Serving ## Online Serving
Our [OpenAI-Compatible Server][openai-compatible-server] provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text. - [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template. - [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.

View File

@ -113,7 +113,7 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/scor
## Online Serving ## Online Serving
Our [OpenAI-Compatible Server][openai-compatible-server] provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models. - [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs][multimodal-inputs] for embedding models. - [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs][multimodal-inputs] for embedding models.

View File

@ -34,7 +34,7 @@ llm.apply_model(lambda model: print(type(model)))
If it is `TransformersForCausalLM` then it means it's based on Transformers! If it is `TransformersForCausalLM` then it means it's based on Transformers!
!!! tip !!! tip
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][openai-compatible-server]. You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][serving-openai-compatible-server].
!!! note !!! note
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM. vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
@ -53,8 +53,8 @@ For a model to be compatible with the Transformers backend for vLLM it must:
If the compatible model is: If the compatible model is:
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][openai-compatible-server]. - on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server].
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][openai-compatible-server]. - in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][serving-openai-compatible-server].
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM! This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!

View File

@ -1,7 +1,7 @@
--- ---
title: OpenAI-Compatible Server title: OpenAI-Compatible Server
--- ---
[](){ #openai-compatible-server } [](){ #serving-openai-compatible-server }
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client. vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.