Cyrus Leung
9fb52e523a
[V1] Support any head size for FlexAttention backend ( #20467 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 09:54:36 -07:00
wang.yuqi
2e26f9156a
[Model][3/N] Automatic conversion of CrossEncoding model ( #20168 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-07-04 05:47:39 -07:00
wang.yuqi
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model ( #19978 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-07-03 13:59:23 +00:00
Jee Jee Li
1819fbda63
[Quantization] Bump to use latest bitsandbytes ( #20424 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-03 21:58:46 +08:00
Cyrus Leung
b024a42e93
[Core] Move multimodal placeholder from chat utils to model definition ( #20355 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-03 08:18:30 +00:00
Nick Hill
657f2f301a
[DP] Support external DP Load Balancer mode ( #19790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-02 10:21:52 -07:00
WangHuaqiang
ccbfb1d1c9
[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models ( #20322 )
...
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com>
2025-07-02 12:53:36 +00:00
Chenheli Hua
2e7cbf2d7d
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. ( #20105 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-07-01 23:34:03 -07:00
Lionel Villard
c05596f1a3
[Perf] Validate @config in pre-commit instead of dynamically ( #20200 )
...
Signed-off-by: Lionel Villard <villard@us.ibm.com>
2025-07-01 05:10:28 -04:00
Luka Govedič
6d42ce8315
[CLI] Improve CLI arg parsing for -O/--compilation-config ( #20156 )
...
Signed-off-by: luka <luka@neuralmagic.com>
2025-07-01 01:03:13 +00:00
Jiayi Yan
7b460c25f9
[BugFix] Fix the incorrect func name in the comments. (config.py) ( #20185 )
2025-06-27 22:51:16 -07:00
Luka Govedič
aafabaa0d5
[Fix][torch.compile] Enable custom ops by default when Inductor off ( #20102 )
...
Signed-off-by: luka <luka@neuralmagic.com>
2025-06-27 09:00:42 -06:00
Michael Goin
4ab3ac285e
[Bugfix] Fix flaky failure when getting DP ports ( #20151 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-06-27 15:30:53 +08:00
wang.yuqi
cd4cfee689
[Model][1/N] Automatic conversion of CrossEncoding model ( #20012 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-06-26 21:10:04 -07:00
Bowen Wang
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) ( #18343 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com>
2025-06-26 15:30:21 -07:00
Michael Goin
1f5d178e9c
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" ( #20128 )
2025-06-26 07:32:22 -07:00
zhrrr
9f0608fc16
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine ( #20062 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
2025-06-25 21:03:17 +00:00
Aaron Pham
ba7ba35cda
[Chore] debloat some initial logs ( #19438 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-06-25 06:36:22 +00:00
David Xia
7108934142
[Frontend] speed up import time of vllm.config ( #18036 )
...
Signed-off-by: David Xia <david@davidxia.com>
2025-06-25 00:41:11 -04:00
cascade
e6327c9b3e
[Feature] Support sequence parallelism for static fp8 quantization ( #19181 )
...
Signed-off-by: cascade812 <cascade812@outlook.com>
2025-06-23 16:09:02 -04:00
Aaron Pham
c4cf260677
[Perf][CLI] Improve overall startup time ( #19941 )
2025-06-22 23:11:22 +00:00
Aaron Pham
e91386cde1
[Chore] dedup logs ( #19955 )
2025-06-22 19:43:07 +00:00
Ning Xie
2bb246b8f7
[MISC] add cpu_kvcache_space_bytes to CacheConfig ( #19812 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-22 13:39:09 +08:00
wangxiyuan
e773a9e1c2
[Misc] Clean up useless code ( #19889 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-06-20 21:09:09 +00:00
Maximilien de Bayser
799397ee4f
Support embedding models in V1 ( #16188 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-06-18 21:36:33 -07:00
Ning Xie
6e9cc73f67
[MISC] correct DeviceConfig device field static type analysis ( #19699 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-17 17:21:50 -07:00
Ning Xie
26bc46ef89
[MISC] typo fix ( #19672 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-16 07:18:49 +00:00
Ye (Charlotte) Qi
b692e9cd07
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config ( #19660 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-06-16 06:30:29 +00:00
Woosuk Kwon
055915e6ce
Enable prefix caching with full cuda graphs ( #19617 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-15 01:05:05 -07:00
Woosuk Kwon
aafbbd981f
[torch.compile] Use custom ops when use_inductor=False ( #19618 )
2025-06-13 15:05:54 -07:00
youkaichao
d70bc7c029
[torch.compile] reorganize the cache directory to support compiling multiple models ( #19064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-13 15:23:25 +08:00
Luka Govedič
f98548b9da
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass ( #16756 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2025-06-12 08:31:04 -07:00
rasmith
c7ea0b56cd
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger ( #17331 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-06-11 15:53:28 -04:00
Richard Zou
77f0d465d0
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 ( #19390 )
...
Signed-off-by: rzou <zou3519@gmail.com>
2025-06-11 07:54:41 +08:00
Siyuan Liu
3a7cd627a8
[Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration ( #19383 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-06-09 16:41:51 -07:00
wang.yuqi
2ffb9b6e07
[Bugfix] model_max_length should consider max_model_len in tokenizer_config ( #19201 )
2025-06-08 07:17:53 -07:00
Richard Zou
eaa2e51088
[Bugfix] Re-enable use_cudagraph in vLLM v1 ( #19299 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-06-08 08:56:12 +08:00
Richard Zou
da511d54d8
Fix CompilationConfig repr ( #19091 )
...
Signed-off-by: rzou <zou3519@gmail.com>
2025-06-06 16:23:35 +08:00
Chen Zhang
f8a1a2d108
[v1] Hybrid Memory Allocator ( #17996 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-06-05 20:47:09 -07:00
Cyrus Leung
01dc9a76db
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52 ( #18678 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-06-04 04:49:20 -07:00
Varun Sundar Rabindranath
fa98d77773
[Kernel] DeepEP dispatch-combine kernel integration ( #18434 )
...
Signed-off-by: Varun <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-06-03 12:30:02 -07:00
Simon Mo
02f0c7b220
[Misc] Add SPDX-FileCopyrightText ( #19100 )
...
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-06-03 11:20:17 -07:00
Rui Qiao
bdce64f236
[V1] Support DP with Ray ( #18779 )
2025-06-02 21:15:13 -07:00
Siyuan Liu
9112b443a0
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD ( #18011 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: Hossein Sarshar <hossein.sarshar@gmail.com>
Co-authored-by: Chengji Yao <chengjiyao@google.com>
2025-06-03 00:06:20 +00:00
Gregory Shtrasberg
ca2f6b9c30
[Bugfix][Model] Attempt to fix eagle in V0. ( #18978 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-06-02 08:15:53 -07:00
jennyyyyzhen
ebb1ec9318
[Model] enable data parallel for Llama4 vision encoder ( #18368 )
...
Signed-off-by: yzhen <yzhen@devgpu093.cco2.facebook.com>
Co-authored-by: yZhen <yZhen@fb.com>
Co-authored-by: yzhen <yzhen@devgpu093.cco2.facebook.com>
2025-06-02 19:22:54 +08:00
Cyrus Leung
6aa8f9a4e7
[Core] Rework dtype resolution ( #18751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-06-01 11:04:23 +08:00
Satyajith Chilappagari
2a50ef5760
[Neuron] Add Multi-Modal model support for Neuron ( #18921 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com>
Co-authored-by: Rohith Nallamaddi <nalrohit@amazon.com>
Co-authored-by: FeliciaLuo <luof@amazon.com>
Co-authored-by: Elaine Zhao <elaineyz@amazon.com>
2025-05-31 10:39:11 +00:00
Yikun Jiang
3c49dbdd03
Skip device and quant Pydantic validation to make plugin device work ( #18843 )
...
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
2025-05-28 20:12:30 -07:00
aws-elaineyz
1661a9c28f
[Doc][Neuron] Update documentation for Neuron ( #18868 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
2025-05-28 19:44:01 -07:00