Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
SangBin Cho
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling ( #3550 )
2024-04-03 14:13:49 -07:00
Michael Feil
c64cf38673
[Doc] Update contribution guidelines for better onboarding ( #3819 )
2024-04-03 07:31:43 +00:00
Robert Shaw
76b889bf1d
[Doc] Update README.md ( #3806 )
2024-04-02 23:11:10 -07:00
Nick Hill
c9b506dad4
[BugFix] Use different mechanism to get vllm version in is_cpu() ( #3804 )
2024-04-02 23:06:25 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding ( #3706 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
youkaichao
a3c226e7eb
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary ( #3803 )
v0.4.0.post1
2024-04-02 12:57:04 -07:00
Michael Goin
b321d4881b
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ ( #3798 )
2024-04-02 12:35:31 -07:00
leiwen83
ad6eca408b
Fix early CUDA init via get_architecture_class_name import ( #3770 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-02 11:56:26 -07:00
youkaichao
205b94942e
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build ( #3801 )
2024-04-02 11:54:33 -07:00
Roger Wang
3bec41f41a
[Doc] Fix vLLMEngine Doc Page ( #3791 )
2024-04-02 09:49:37 -07:00
A-Mahla
0739b1947f
[Frontend][Bugfix] allow using the default middleware with a root path ( #3788 )
...
Co-authored-by: A-Mahla <>
2024-04-02 01:20:28 -07:00
bigPYJ1151
77a6572aa5
[HotFix] [CI/Build] Minor fix for CPU backend CI ( #3787 )
2024-04-01 22:50:53 -07:00
bigPYJ1151
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
2024-04-01 22:07:30 -07:00
Cade Daniel
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup ( #3783 )
2024-04-02 00:49:51 +00:00
Qubitium
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-01 19:32:01 -04:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
Roger Wang
ccb58b23e6
[Misc] Fix Benchmark TTFT Calculation for Chat Completions ( #3768 )
2024-04-01 15:24:30 -07:00
Nick Hill
49782fcb76
[Misc] Some minor simplifications to detokenization logic ( #3670 )
...
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
2024-04-01 13:22:06 -07:00
Woosuk Kwon
f03cc667a0
[Misc] Minor fixes in requirements.txt ( #3769 )
2024-04-01 10:15:48 +00:00
Robert Shaw
563c1d7ec5
[CI/Build] Make Marlin Tests Green ( #3753 )
2024-03-30 19:18:34 -07:00
youkaichao
9c82a1bec3
[Doc] Update installation doc ( #3746 )
...
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version (#3746 )
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-30 16:34:38 -07:00
mawong-amd
b6d103542c
[Kernel] Layernorm performance optimization ( #3662 )
2024-03-30 14:26:38 -07:00
Simon Mo
51c31bc10c
CMake build elf without PTX ( #3739 )
v0.4.0
2024-03-30 01:53:08 +00:00
bnellnm
3ad438c66f
Fix build when nvtools is missing ( #3698 )
2024-03-29 18:52:39 -07:00
youkaichao
203d4f82ac
[Core][Bugfix] cache len of tokenizer ( #3741 )
2024-03-29 18:46:39 -07:00
Nick Hill
991143cfcd
[BugFix] Use consistent logger everywhere ( #3738 )
2024-03-29 23:26:44 +00:00
Simon Mo
8b2d3cbc1b
usage lib get version another way ( #3735 )
2024-03-29 15:57:08 -07:00
Hongxia Yang
9765b5c406
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic ( #3699 )
2024-03-29 14:52:36 -07:00
Simon Mo
430530fc18
bump version to v0.4.0 ( #3712 )
2024-03-29 12:28:33 -07:00
Roger Wang
97356f3c7e
[Bugfix] Command-R Max Model Length ( #3727 )
2024-03-29 12:27:51 -07:00
Roy
f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error ( #3731 )
2024-03-29 09:38:21 -07:00
Roy
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
yhu422
d8658c8cc1
Usage Stats Collection ( #2852 )
2024-03-28 22:16:12 -07:00
Simon Mo
7bc94a0fdd
add ccache to docker build image ( #3704 )
2024-03-28 22:14:24 -07:00
youkaichao
756b30a5f3
[Core][Test] move local_rank to the last arg with default value( #3711 )
...
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711 )
2024-03-28 21:19:45 -07:00
Woosuk Kwon
395aa823ea
[Misc] Minor type annotation fix ( #3716 )
2024-03-28 21:12:24 -07:00
SangBin Cho
26422e477b
[Test] Make model tests run again and remove --forked from pytest ( #3631 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-28 21:06:40 -07:00
youkaichao
f342153b48
Revert "bump version to v0.4.0" ( #3708 )
2024-03-28 18:49:42 -07:00
Simon Mo
27a57cad52
bump version to v0.4.0 ( #3705 )
2024-03-28 18:26:51 -07:00
Yile (Michael) Gu
98a42e7078
[Benchmark] Change mii to use persistent deployment and support tensor parallel ( #3628 )
2024-03-28 17:33:52 -07:00
youkaichao
0267fef52a
[Core] fix del of communicator ( #3702 )
2024-03-29 00:24:58 +00:00
Simon Mo
4716a32dd4
fix logging msg for block manager ( #3701 )
2024-03-28 23:29:55 +00:00
Woosuk Kwon
c0935c96d3
[Bugfix] Set enable_prefix_caching=True in prefix caching example ( #3703 )
2024-03-28 16:26:30 -07:00
Woosuk Kwon
cb40b3ab6b
[Kernel] Add MoE Triton kernel configs for A100 40GB ( #3700 )
2024-03-28 15:26:24 -07:00
Roy
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) ( #3686 )
2024-03-28 15:01:55 -07:00
Simon Mo
a4075cba4d
[CI] Add test case to run examples scripts ( #3638 )
2024-03-28 14:36:10 -07:00
Simon Mo
96aa014d1e
fix benchmark format reporting in buildkite ( #3693 )
2024-03-28 14:35:16 -07:00
Adam Boeglin
1715056fef
[Bugfix] Update neuron_executor.py to add optional vision_language_config ( #3695 )
2024-03-28 10:43:34 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00