mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 06:35:00 +08:00
[Doc]Add documentation for using EAGLE in vLLM (#11417)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
This commit is contained in:
parent
c994223d56
commit
973f5dc581
@ -159,6 +159,72 @@ A variety of speculative models of this type are available on HF hub:
|
|||||||
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
|
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
|
||||||
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
|
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
|
||||||
|
|
||||||
|
## Speculating using EAGLE based draft models
|
||||||
|
|
||||||
|
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||||
|
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from vllm import LLM, SamplingParams
|
||||||
|
|
||||||
|
prompts = [
|
||||||
|
"The future of AI is",
|
||||||
|
]
|
||||||
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||||
|
|
||||||
|
llm = LLM(
|
||||||
|
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||||
|
tensor_parallel_size=4,
|
||||||
|
speculative_model="path/to/modified/eagle/model",
|
||||||
|
speculative_draft_tensor_parallel_size=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
outputs = llm.generate(prompts, sampling_params)
|
||||||
|
|
||||||
|
for output in outputs:
|
||||||
|
prompt = output.prompt
|
||||||
|
generated_text = output.outputs[0].text
|
||||||
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
A few important things to consider when using the EAGLE based draft models:
|
||||||
|
|
||||||
|
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
|
||||||
|
used directly with vLLM due to differences in the expected layer names and model definition.
|
||||||
|
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
|
||||||
|
to convert them. Note that this script does not modify the model's weights.
|
||||||
|
|
||||||
|
In the above example, use the script to first convert
|
||||||
|
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
|
||||||
|
and then use the converted checkpoint as the draft model in vLLM.
|
||||||
|
|
||||||
|
2. The EAGLE based draft models need to be run without tensor parallelism
|
||||||
|
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
|
||||||
|
it is possible to run the main model using tensor parallelism (see example above).
|
||||||
|
|
||||||
|
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
|
||||||
|
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
|
||||||
|
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
|
||||||
|
|
||||||
|
|
||||||
|
A variety of EAGLE draft models are available on the Hugging Face hub:
|
||||||
|
|
||||||
|
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
|
||||||
|
|---------------------------------------------------------------------|-------------------------------------------|--------------------|
|
||||||
|
| Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
|
||||||
|
| Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
|
||||||
|
| Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
|
||||||
|
| LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
|
||||||
|
| LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
|
||||||
|
| LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
|
||||||
|
| Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
|
||||||
|
| LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
|
||||||
|
| LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
|
||||||
|
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
|
||||||
|
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
|
||||||
|
|
||||||
|
|
||||||
## Lossless guarantees of Speculative Decoding
|
## Lossless guarantees of Speculative Decoding
|
||||||
|
|
||||||
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user