Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
leiwen83
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-24 10:07:09 -07:00
Robert Shaw
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-24 12:28:27 +00:00
youkaichao
6a50f4cafa
[Doc] add ccache guide in doc ( #5012 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-23 23:21:54 +00:00
Elisei Smirnov
e3470f8753
[Core]: Option To Use Prompt Token Ids Inside Logits Processor ( #4985 )
...
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>
2024-05-23 22:04:24 +00:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Murali Andoorveedu
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-05-23 09:54:48 -07:00
Letian Li
2ba80bed27
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined ( #5009 )
2024-05-23 09:08:58 -07:00
Alexander Matveev
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
Cody Yu
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
Philipp Moritz
a36de682d4
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig ( #4991 )
2024-05-22 22:26:56 +00:00
Nick Hill
eb6d3c264d
[Core] Eliminate parallel worker per-step task scheduling overhead ( #4894 )
2024-05-23 06:17:27 +09:00
raywanb
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
Cody Yu
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
Tyler Michael Smith
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
SangBin Cho
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
Michael Goin
5f6d10c14c
[CI/Build] Enforce style for C++ and CUDA code with clang-format ( #4722 )
2024-05-22 07:18:41 +00:00
sasha0552
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
Isotr0py
99eff67ba9
[Bugfix][Kernel] Add head size check for attention backend selection ( #4944 )
2024-05-21 15:33:25 -04:00
Kante Yin
14772eeb8e
[Bugfix] Fix flag name for max_seq_len_to_capture ( #4935 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com>
2024-05-21 09:30:52 -07:00
Michael Goin
757b62c495
[CI/Build] Codespell ignore build/ directory ( #4945 )
2024-05-21 09:06:10 -07:00
Simon Mo
e941f88584
[Docs] Add acknowledgment for sponsors ( #4925 )
2024-05-21 00:17:25 -07:00
Isotr0py
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
HUANG Fei
d130b573a0
[Model] add rope_scaling support for qwen2 ( #4930 )
2024-05-21 05:22:22 +00:00
Antoni Baum
65ae8c2c8f
[Core] Fix scheduler considering "no LoRA" as "LoRA" ( #4897 )
2024-05-20 17:48:32 -07:00
Kuntai Du
c3af44722c
[Doc]Add documentation to benchmarking script when running TGI ( #4920 )
2024-05-20 20:16:57 +00:00
Aurick Qiao
1937e29848
[Core] Sharded State Loader download from HF ( #4889 )
2024-05-20 11:46:12 -07:00
Mor Zusman
f0eecee610
[Bugfix] Fix dummy weight for fp8 ( #4916 )
...
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-05-20 18:44:25 +00:00
Alexei-V-Ivanov-AMD
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
2024-05-20 11:29:28 -07:00
Wenwei Zhang
546a97ef69
[Misc]: allow user to specify port in distributed setting ( #4914 )
2024-05-20 17:45:06 +00:00
Alexander Matveev
da5a0b539d
Remove marlin warning ( #4918 )
2024-05-20 14:55:34 +00:00
Cyrus Leung
6287537a0c
[Model] LLaVA model refactor ( #4910 )
2024-05-20 08:11:25 +00:00
Woosuk Kwon
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
Alexander Matveev
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
Cyrus Leung
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
SangBin Cho
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
alexeykondrat
c0724fc915
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used ( #4658 )
2024-05-18 05:09:11 +00:00
Michael Goin
86b45ae065
[Bugfix] Relax tiktoken to >= 0.6.0 ( #4890 )
2024-05-17 12:58:52 -06:00
Antoni Baum
c5711ef985
[Doc] Update Ray Data distributed offline inference example ( #4871 )
2024-05-17 10:52:11 -07:00
eigenLiu
48d5985a08
Sync huggingface modifications of qwen Moe model ( #4774 )
2024-05-17 09:43:19 -07:00
Jinzhen Lin
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
bofeng huang
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
Kante Yin
8e7fb5d43a
Support to serve vLLM on Kubernetes with LWS ( #4829 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com>
2024-05-16 16:37:29 -07:00
Woosuk Kwon
9a31a817a8
[Bugfix] Fix FP8 KV cache support ( #4869 )
2024-05-16 22:42:29 +00:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Silencio
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
youkaichao
10fa9eea21
[Misc] remove old comments ( #4866 )
2024-05-16 11:07:41 -07:00
youkaichao
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
Hongxia Yang
b5853f9963
[ROCm][AMD][Bugfix] adding a missing triton autotune config ( #4845 )
2024-05-16 10:46:52 -07:00