mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-15 03:15:01 +08:00
[Docs] Rewrite offline inference guide (#20594)
Signed-off-by: Ricardo Decal <rdecal@anyscale.com>
This commit is contained in:
parent
6e428cdd7a
commit
0d914c81a2
@ -3,10 +3,7 @@ title: Offline Inference
|
|||||||
---
|
---
|
||||||
[](){ #offline-inference }
|
[](){ #offline-inference }
|
||||||
|
|
||||||
You can run vLLM in your own code on a list of prompts.
|
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
|
||||||
|
|
||||||
The offline API is based on the [LLM][vllm.LLM] class.
|
|
||||||
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
|
|
||||||
|
|
||||||
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
|
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
|
||||||
and runs it in vLLM using the default configuration.
|
and runs it in vLLM using the default configuration.
|
||||||
@ -14,16 +11,30 @@ and runs it in vLLM using the default configuration.
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
|
# Initialize the vLLM engine.
|
||||||
llm = LLM(model="facebook/opt-125m")
|
llm = LLM(model="facebook/opt-125m")
|
||||||
```
|
```
|
||||||
|
|
||||||
After initializing the `LLM` instance, you can perform model inference using various APIs.
|
After initializing the `LLM` instance, use the available APIs to perform model inference.
|
||||||
The available APIs depend on the type of model that is being run:
|
The available APIs depend on the model type:
|
||||||
|
|
||||||
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
|
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
|
||||||
- [Pooling models][pooling-models] output their hidden states directly.
|
- [Pooling models][pooling-models] output their hidden states directly.
|
||||||
|
|
||||||
Please refer to the above pages for more details about each API.
|
|
||||||
|
|
||||||
!!! info
|
!!! info
|
||||||
[API Reference][offline-inference-api]
|
[API Reference][offline-inference-api]
|
||||||
|
|
||||||
|
### Ray Data LLM API
|
||||||
|
|
||||||
|
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
|
||||||
|
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:
|
||||||
|
|
||||||
|
- Streaming execution processes datasets that exceed aggregate cluster memory.
|
||||||
|
- Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
|
||||||
|
- Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
|
||||||
|
- Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
|
||||||
|
|
||||||
|
The following example shows how to run batched inference with Ray Data and vLLM:
|
||||||
|
<gh-file:examples/offline_inference/batch_llm_inference.py>
|
||||||
|
|
||||||
|
For more information about the Ray Data LLM API, see the [Ray Data LLM documentation](https://docs.ray.io/en/latest/data/working-with-llms.html).
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user