mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-14 05:15:37 +08:00
Signed-off-by: Andrew Sansom <andrew@protopia.ai> Signed-off-by: Nan2018 <nan@protopia.ai> Co-authored-by: 临景 <linjing.yx@alibaba-inc.com> Co-authored-by: Bryce1010 <bryceyx@gmail.com> Co-authored-by: Andrew Sansom <andrew@protopia.ai> Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
143 lines
4.7 KiB
Markdown
143 lines
4.7 KiB
Markdown
# Prompt Embedding Inputs
|
|
|
|
This page teaches you how to pass prompt embedding inputs to vLLM.
|
|
|
|
## What are prompt embeddings?
|
|
|
|
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
|
|
|
|
:::{note}
|
|
Prompt embeddings are currently only supported in the v0 engine.
|
|
:::
|
|
|
|
## Offline Inference
|
|
|
|
To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`:
|
|
|
|
- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
|
|
|
|
### Hugging Face Transformers Inputs
|
|
|
|
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
|
|
|
|
```python
|
|
from vllm import LLM
|
|
import transformers
|
|
|
|
model_name = "meta-llama/Llama-3.2-1B-Instruct"
|
|
|
|
# Transformers
|
|
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
|
|
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
|
|
|
|
llm = LLM(model=model_name, enable_prompt_embeds=True)
|
|
|
|
# Refer to the HuggingFace repo for the correct format to use
|
|
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
|
|
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')
|
|
|
|
prompt_embeds = embedding_layer(token_ids).squeeze(0)
|
|
|
|
# Single prompt inference
|
|
outputs = llm.generate({
|
|
"prompt_embeds": prompt_embeds,
|
|
})
|
|
|
|
for o in outputs:
|
|
generated_text = o.outputs[0].text
|
|
print(generated_text)
|
|
|
|
# Batch inference
|
|
|
|
chats = [
|
|
[{"role": "user", "content": "Please tell me about the capital of France."}],
|
|
[{"role": "user", "content": "When is the day longest during the year?"}],
|
|
[{"role": "user", "content": "Where is bigger, the moon or the sun?"}]
|
|
]
|
|
|
|
token_ids_list = [
|
|
tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt') for chat in chats
|
|
]
|
|
prompt_embeds_list = [embedding_layer(token_ids).squeeze(0) for token_ids in token_ids_list]
|
|
|
|
outputs = llm.generate(
|
|
[
|
|
{
|
|
"prompt_embeds": prompt_embeds,
|
|
} for prompt_embeds in prompt_embeds_list
|
|
]
|
|
)
|
|
|
|
for o in outputs:
|
|
generated_text = o.outputs[0].text
|
|
print(generated_text)
|
|
```
|
|
|
|
## Online Serving
|
|
|
|
Our OpenAI-compatible server accepts prompt embeddings inputs via the [Completions API](https://platform.openai.com/docs/api-reference/completions). Prompt embeddings inputs are added via a new `'prompt_embeds'` key in the JSON package.
|
|
|
|
When a mixture of `'prompt_embeds'` and `'prompt'` inputs are provided in a single request, the prompt embeds are always returned first.
|
|
|
|
Prompt embeddings are passed in as base64 encoded torch tensors.
|
|
|
|
### Transformers Inputs via OpenAI Client
|
|
|
|
First, launch the OpenAI-compatible server:
|
|
|
|
```bash
|
|
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
|
|
--max-model-len 4096 --enable-prompt-embeds
|
|
```
|
|
|
|
Then, you can use the OpenAI client as follows:
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
import transformers
|
|
import torch
|
|
|
|
openai_api_key = "EMPTY"
|
|
openai_api_base = "http://localhost:8000/v1"
|
|
|
|
client = OpenAI(
|
|
api_key=openai_api_key,
|
|
base_url=openai_api_base,
|
|
)
|
|
|
|
model_name = "meta-llama/Llama-3.2-1B-Instruct"
|
|
|
|
# Transformers
|
|
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
|
|
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
|
|
|
|
|
|
# Refer to the HuggingFace repo for the correct format to use
|
|
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
|
|
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors='pt')
|
|
|
|
prompt_embeds = embedding_layer(token_ids).squeeze(0)
|
|
|
|
# Prompt embeddings
|
|
buffer = io.BytesIO()
|
|
torch.save(prompt_embeds, buffer)
|
|
buffer.seek(0)
|
|
binary_data = buffer.read()
|
|
encoded_embeds = base64.b64encode(binary_data).decode('utf-8')
|
|
|
|
|
|
completion = client_with_prompt_embeds.completions.create(
|
|
model=model_name,
|
|
# NOTE: The OpenAI client does not allow `None` as an input to
|
|
# `prompt`. Use an empty string if you have no text prompts.
|
|
prompt="",
|
|
max_tokens=5,
|
|
temperature=0.0,
|
|
# NOTE: The OpenAI client allows passing in extra JSON body via the
|
|
# `extra_body` argument.
|
|
extra_body={"prompt_embeds": encoded_embeds}
|
|
)
|
|
|
|
print(completion.choices[0].text)
|
|
```
|