diff --git a/.buildkite/run-cpu-test.sh b/.buildkite/run-cpu-test.sh index 1a4dae8f65e9..5a285be03939 100644 --- a/.buildkite/run-cpu-test.sh +++ b/.buildkite/run-cpu-test.sh @@ -61,7 +61,7 @@ function cpu_tests() { pytest -s -v -k cpu_model \ tests/basic_correctness/test_chunked_prefill.py" - # online inference + # online serving docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c " set -e export VLLM_CPU_KVCACHE_SPACE=10 diff --git a/docs/source/features/structured_outputs.md b/docs/source/features/structured_outputs.md index ccd9a6a1b1a1..a42c3dd64ad1 100644 --- a/docs/source/features/structured_outputs.md +++ b/docs/source/features/structured_outputs.md @@ -5,7 +5,7 @@ vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding. This document shows you some examples of the different options that are available to generate structured outputs. -## Online Inference (OpenAI API) +## Online Serving (OpenAI API) You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API. @@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are: - `backend` - `whitespace_pattern` -These parameters can be used in the same way as the parameters from the Online Inference examples above. +These parameters can be used in the same way as the parameters from the Online Serving examples above. One example for the usage of the `choices` parameter is shown below: ```python diff --git a/docs/source/getting_started/installation/hpu-gaudi.md b/docs/source/getting_started/installation/hpu-gaudi.md index 1d50cef3bdc8..21822327c882 100644 --- a/docs/source/getting_started/installation/hpu-gaudi.md +++ b/docs/source/getting_started/installation/hpu-gaudi.md @@ -83,7 +83,7 @@ $ python setup.py develop ## Supported Features - [Offline inference](#offline-inference) -- Online inference via [OpenAI-Compatible Server](#openai-compatible-server) +- Online serving via [OpenAI-Compatible Server](#openai-compatible-server) - HPU autodetection - no need to manually select device within vLLM - Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Custom Intel Gaudi implementations of Paged Attention, KV cache ops, @@ -385,5 +385,5 @@ the below: completely. With HPU Graphs disabled, you are trading latency and throughput at lower batches for potentially higher throughput on higher batches. You can do that by adding `--enforce-eager` flag to - server (for online inference), or by passing `enforce_eager=True` + server (for online serving), or by passing `enforce_eager=True` argument to LLM constructor (for offline inference). diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md index ea15d9ef065f..d7d43785c6c2 100644 --- a/docs/source/getting_started/quickstart.md +++ b/docs/source/getting_started/quickstart.md @@ -5,7 +5,7 @@ This guide will help you quickly get started with vLLM to perform: - [Offline batched inference](#quickstart-offline) -- [Online inference using OpenAI-compatible server](#quickstart-online) +- [Online serving using OpenAI-compatible server](#quickstart-online) ## Prerequisites diff --git a/docs/source/models/generative_models.md b/docs/source/models/generative_models.md index a9f74c4d3fbb..6a5a58ad74ab 100644 --- a/docs/source/models/generative_models.md +++ b/docs/source/models/generative_models.md @@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template) outputs = llm.chat(conversation, chat_template=custom_template) ``` -## Online Inference +## Online Serving Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs: diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md index 745f3fd81980..324b1f550e69 100644 --- a/docs/source/models/pooling_models.md +++ b/docs/source/models/pooling_models.md @@ -127,7 +127,7 @@ print(f"Score: {score}") A code example can be found here: -## Online Inference +## Online Serving Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs: diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md index acbe27a22a67..72910ea1e2d1 100644 --- a/docs/source/models/supported_models.md +++ b/docs/source/models/supported_models.md @@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod ````{important} To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference) -or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt: +or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt: Offline inference: ```python @@ -562,7 +562,7 @@ llm = LLM( ) ``` -Online inference: +Online serving: ```bash vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4 ``` diff --git a/docs/source/serving/multimodal_inputs.md b/docs/source/serving/multimodal_inputs.md index 9f5e1b908d78..7e96ed46f2dc 100644 --- a/docs/source/serving/multimodal_inputs.md +++ b/docs/source/serving/multimodal_inputs.md @@ -199,7 +199,7 @@ for o in outputs: print(generated_text) ``` -## Online Inference +## Online Serving Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). diff --git a/examples/online_serving/openai_chat_completion_client_for_multimodal.py b/examples/online_serving/openai_chat_completion_client_for_multimodal.py index 213d075542e8..03cc037bb677 100644 --- a/examples/online_serving/openai_chat_completion_client_for_multimodal.py +++ b/examples/online_serving/openai_chat_completion_client_for_multimodal.py @@ -1,5 +1,5 @@ """An example showing how to use vLLM to serve multimodal models -and run online inference with OpenAI client. +and run online serving with OpenAI client. Launch the vLLM server with the following command: @@ -309,7 +309,7 @@ def main(args) -> None: if __name__ == "__main__": parser = FlexibleArgumentParser( - description='Demo on using OpenAI client for online inference with ' + description='Demo on using OpenAI client for online serving with ' 'multimodal language models served with vLLM.') parser.add_argument('--chat-type', '-c', diff --git a/tests/models/decoder_only/audio_language/test_ultravox.py b/tests/models/decoder_only/audio_language/test_ultravox.py index 0bb98df1b58e..1e329dc4cb22 100644 --- a/tests/models/decoder_only/audio_language/test_ultravox.py +++ b/tests/models/decoder_only/audio_language/test_ultravox.py @@ -237,8 +237,8 @@ def test_models_with_multiple_audios(vllm_runner, audio_assets, dtype: str, @pytest.mark.asyncio -async def test_online_inference(client, audio_assets): - """Exercises online inference with/without chunked prefill enabled.""" +async def test_online_serving(client, audio_assets): + """Exercises online serving with/without chunked prefill enabled.""" messages = [{ "role": diff --git a/vllm/model_executor/models/molmo.py b/vllm/model_executor/models/molmo.py index 2e60bc719f09..c45ee9b921c9 100644 --- a/vllm/model_executor/models/molmo.py +++ b/vllm/model_executor/models/molmo.py @@ -1068,7 +1068,7 @@ def input_processor_for_molmo(ctx: InputContext, inputs: DecoderOnlyInputs): trust_remote_code=model_config.trust_remote_code) # NOTE: message formatting for raw text prompt is only applied for - # offline inference; for online inference, the prompt is always in + # offline inference; for online serving, the prompt is always in # instruction format and tokenized. if prompt is not None and re.match(r"^User:[\s\S]*?(Assistant:)*$", prompt):