vllm/offline_inference.md at 0d914c81a2688c616308ea591374dfccd1098750

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-10 02:55:40 +08:00

[Docs] Rewrite offline inference guide (#20594 )

Signed-off-by: Ricardo Decal <rdecal@anyscale.com>

2025-07-07 20:06:02 -07:00

1.7 KiB

Raw Blame History

title
Offline Inference

{ #offline-inference }

Offline inference is possible in your own code using vLLM's [LLM][vllm.LLM] class.

For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.

from vllm import LLM

# Initialize the vLLM engine.
llm = LLM(model="facebook/opt-125m")

After initializing the LLM instance, use the available APIs to perform model inference. The available APIs depend on the model type:

[Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
[Pooling models][pooling-models] output their hidden states directly.

!!! info [API Reference][offline-inference-api]

Ray Data LLM API

Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:

Streaming execution processes datasets that exceed aggregate cluster memory.
Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.

The following example shows how to run batched inference with Ray Data and vLLM: gh-file:examples/offline_inference/batch_llm_inference.py

For more information about the Ray Data LLM API, see the Ray Data LLM documentation.

1.7 KiB Raw Blame History

Ray Data LLM API

1.7 KiB

Raw Blame History