Kiran R
|
bc0c0192d1
|
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration (#3767)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-04-08 19:42:35 +00:00 |
|
egortolmachev
|
f46864d68d
|
[Bugfix] Added Command-R GPTQ support (#3849)
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
|
2024-04-08 14:59:38 +00:00 |
|
ywfang
|
b4543c8f6b
|
[Model] add minicpm (#3893)
|
2024-04-08 18:28:36 +08:00 |
|
Isotr0py
|
0ce0539d47
|
[Bugfix] Fix Llava inference with Tensor Parallelism. (#3883)
|
2024-04-07 22:54:13 +08:00 |
|
youkaichao
|
2f19283549
|
[Core] latency optimization (#3890)
|
2024-04-06 19:14:06 -07:00 |
|
youkaichao
|
95baec828f
|
[Core] enable out-of-tree model register (#3871)
|
2024-04-06 17:11:41 -07:00 |
|
Isotr0py
|
54951ac4bf
|
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869)
|
2024-04-05 12:02:09 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Thomas Parnell
|
1d7c940d74
|
Add option to completion API to truncate prompt tokens (#3144)
|
2024-04-05 10:15:42 -07:00 |
|
youkaichao
|
c391e4b68e
|
[Core] improve robustness of pynccl (#3860)
|
2024-04-04 16:52:12 -07:00 |
|
Saurabh Dash
|
9117f892f0
|
[Model] Cohere CommandR+ (#3829)
|
2024-04-04 13:31:49 -07:00 |
|
youkaichao
|
ca81ff5196
|
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805)
|
2024-04-04 10:26:19 -07:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Woosuk Kwon
|
498eb5cfa3
|
[Bugfix] Add kv_scale input parameter to CPU backend (#3840)
|
2024-04-04 04:33:08 +00:00 |
|
Michael Feil
|
537ee25f43
|
[Core] Enable hf_transfer by default if available (#3817)
|
2024-04-04 04:02:43 +00:00 |
|
Tao He
|
294f8f6665
|
[BugFix] Pass tokenizer_config to local_tokenizer_group (#3754)
Signed-off-by: Tao He <sighingnow@gmail.com>
|
2024-04-03 20:31:46 -07:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Nick Hill
|
c9b506dad4
|
[BugFix] Use different mechanism to get vllm version in is_cpu() (#3804)
|
2024-04-02 23:06:25 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
youkaichao
|
a3c226e7eb
|
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803)
|
2024-04-02 12:57:04 -07:00 |
|
Michael Goin
|
b321d4881b
|
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ (#3798)
|
2024-04-02 12:35:31 -07:00 |
|
leiwen83
|
ad6eca408b
|
Fix early CUDA init via get_architecture_class_name import (#3770)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-04-02 11:56:26 -07:00 |
|
A-Mahla
|
0739b1947f
|
[Frontend][Bugfix] allow using the default middleware with a root path (#3788)
Co-authored-by: A-Mahla <>
|
2024-04-02 01:20:28 -07:00 |
|
bigPYJ1151
|
0e3f06fe9c
|
[Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
|
2024-04-01 22:07:30 -07:00 |
|
Qubitium
|
7d4e1b85e7
|
[Misc] Add support for new autogptq checkpoint_format (#3689)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-04-01 19:32:01 -04:00 |
|
Cade Daniel
|
93deb0b38f
|
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250)
|
2024-04-01 22:55:24 +00:00 |
|
Nick Hill
|
49782fcb76
|
[Misc] Some minor simplifications to detokenization logic (#3670)
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
|
2024-04-01 13:22:06 -07:00 |
|
youkaichao
|
203d4f82ac
|
[Core][Bugfix] cache len of tokenizer (#3741)
|
2024-03-29 18:46:39 -07:00 |
|
Nick Hill
|
991143cfcd
|
[BugFix] Use consistent logger everywhere (#3738)
|
2024-03-29 23:26:44 +00:00 |
|
Simon Mo
|
8b2d3cbc1b
|
usage lib get version another way (#3735)
|
2024-03-29 15:57:08 -07:00 |
|
Hongxia Yang
|
9765b5c406
|
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic (#3699)
|
2024-03-29 14:52:36 -07:00 |
|
Simon Mo
|
430530fc18
|
bump version to v0.4.0 (#3712)
|
2024-03-29 12:28:33 -07:00 |
|
Roger Wang
|
97356f3c7e
|
[Bugfix] Command-R Max Model Length (#3727)
|
2024-03-29 12:27:51 -07:00 |
|
Roy
|
f510395bbf
|
[BugFix][Frontend] Fix completion logprobs=0 error (#3731)
|
2024-03-29 09:38:21 -07:00 |
|
Roy
|
6110c39dc8
|
[BugFix] Fix tokenizer out of vocab size (#3685)
|
2024-03-29 08:18:59 -07:00 |
|
yhu422
|
d8658c8cc1
|
Usage Stats Collection (#2852)
|
2024-03-28 22:16:12 -07:00 |
|
youkaichao
|
756b30a5f3
|
[Core][Test] move local_rank to the last arg with default value(#3711)
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)
|
2024-03-28 21:19:45 -07:00 |
|
Woosuk Kwon
|
395aa823ea
|
[Misc] Minor type annotation fix (#3716)
|
2024-03-28 21:12:24 -07:00 |
|
youkaichao
|
f342153b48
|
Revert "bump version to v0.4.0" (#3708)
|
2024-03-28 18:49:42 -07:00 |
|
Simon Mo
|
27a57cad52
|
bump version to v0.4.0 (#3705)
|
2024-03-28 18:26:51 -07:00 |
|
youkaichao
|
0267fef52a
|
[Core] fix del of communicator (#3702)
|
2024-03-29 00:24:58 +00:00 |
|
Simon Mo
|
4716a32dd4
|
fix logging msg for block manager (#3701)
|
2024-03-28 23:29:55 +00:00 |
|
Woosuk Kwon
|
cb40b3ab6b
|
[Kernel] Add MoE Triton kernel configs for A100 40GB (#3700)
|
2024-03-28 15:26:24 -07:00 |
|
Roy
|
515386ef3c
|
[Core] Support multi-node inference(eager and cuda graph) (#3686)
|
2024-03-28 15:01:55 -07:00 |
|
Adam Boeglin
|
1715056fef
|
[Bugfix] Update neuron_executor.py to add optional vision_language_config (#3695)
|
2024-03-28 10:43:34 -07:00 |
|
SangBin Cho
|
b51c1cc9d2
|
[2/N] Chunked prefill data update (#3538)
|
2024-03-28 10:06:01 -07:00 |
|
Roger Wang
|
ce567a2926
|
[Kernel] DBRX Triton MoE kernel H100 (#3692)
|
2024-03-28 10:05:34 -07:00 |
|
wenyujin333
|
d6ea427f04
|
[Model] Add support for Qwen2MoeModel (#3346)
|
2024-03-28 15:19:59 +00:00 |
|
Cade Daniel
|
14ccd94c89
|
[Core][Bugfix]Refactor block manager for better testability (#3492)
|
2024-03-27 23:59:28 -07:00 |
|