2445 Commits

Author SHA1 Message Date
Isotr0py
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention (#19754)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-05 09:41:10 +00:00
Jee Jee Li
906e05d840
[Misc] Remove the unused LoRA test code (#20494)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-05 13:48:16 +08:00
Michael Goin
c108781c85
[CI Bugfix] Fix pre-commit failures on main (#20502) 2025-07-04 14:17:30 -07:00
Duncan Moss
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 (#19757)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Duncan Moss <dmoss@nvidia.com>
2025-07-04 12:58:04 -06:00
Thomas Parnell
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models (#20016)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-07-04 17:46:53 +00:00
wang.yuqi
2e26f9156a
[Model][3/N] Automatic conversion of CrossEncoding model (#20168)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-07-04 05:47:39 -07:00
sangbumlikeagod
9e5452ee34
[Bug][Frontend] Fix structure of transcription's decoder_prompt (#18809)
Signed-off-by: sangbumlikeagod <oironese@naver.com>
2025-07-04 11:28:07 +00:00
Jee Jee Li
1caca5a589
[Misc] Add SPDX-FileCopyrightText (#20428)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00
Aaron Pham
4a98edff1f
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers (#20365)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-07-04 15:05:49 +08:00
Seiji Eicher
8d1096e7db
[Bugfix] Register reducer even if transformers_modules not available (#19510)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2025-07-03 22:08:12 +00:00
bnellnm
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. (#18864)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-07-03 14:55:40 -07:00
Ning Xie
1dba2c4ebe
[Misc] adjust for ipv6 for mookcacke url parse (#20107)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-07-03 20:27:17 +00:00
wang.yuqi
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model (#19978)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-07-03 13:59:23 +00:00
Cyrus Leung
b024a42e93
[Core] Move multimodal placeholder from chat utils to model definition (#20355)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-03 08:18:30 +00:00
Nick Hill
67d25eca05
[Tests] Update online DP tests to verify that requests are balanced (#20157)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-03 14:49:13 +08:00
qscqesze
363528de27
[Feature] Support MiniMax-M1 function calls features (#20297)
Signed-off-by: QscQ <qscqesze@gmail.com>
Signed-off-by: qingjun <qingjun@minimaxi.com>
2025-07-03 06:48:27 +00:00
Chenheli Hua
b616f6a53d
[Misc] Small: Fix video loader return type annotations. (#20389)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-07-03 03:10:39 +00:00
Nick Hill
657f2f301a
[DP] Support external DP Load Balancer mode (#19790)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-02 10:21:52 -07:00
Nick Hill
d265414dbc
[Minor] Clean up incorrect comment in test (#20382)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-07-02 09:13:37 -07:00
afeldman-nm
48fb076cbc
[V1] LogitsProcessor programming model (#16728)
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-07-02 09:10:42 -07:00
bnellnm
c1909e7e8c
[Kernels] MoE refactor (#19636)
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
2025-07-02 06:08:27 -07:00
WangHuaqiang
ccbfb1d1c9
[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models (#20322)
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com>
2025-07-02 12:53:36 +00:00
CSWYF3634076
e303dcf523
[Model] Add Ernie4.5 and Ernie4.5MoE Model Support (#20220)
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
2025-07-02 03:37:01 -07:00
Kwai-Keye
8452946c06
[Model][VLM] Support Keye-VL-8B-Preview (#20126)
Signed-off-by: Kwai-Keye <Keye@kuaishou.com>
2025-07-01 23:35:04 -07:00
Chenheli Hua
2e7cbf2d7d
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. (#20105)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-07-01 23:34:03 -07:00
Chengji Yao
7da296be04
[TPU] kv cache update kernel supports dynamic grid (#20235)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-02 06:33:37 +00:00
Wentao Ye
7058d7dd5d
[Refactor] Remove duplicate find_free_port (#20333)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 19:03:07 -07:00
Liangliang Ma
a0389e0554
[UT][intel GPU] use current_platform instead of device hardcode in v1 tests (#20169)
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
2025-07-02 09:06:04 +08:00
czhu-cohere
3abfe22154
Enable group size 64 for Machete (#20290)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 18:05:44 -07:00
Wentao Ye
e81fbefe8a
[Refactor] Refactor import utils (#20269)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 18:05:42 -07:00
Woosuk Kwon
7f280d69c9
[Optimization] Cache sampled token ids in model runner (#20291)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-01 11:01:31 -07:00
aiyiwang2025
ecad851cbd
[Model]Add Tencent HunYuanMoEV1 Model Support (#20114)
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: quinnrong <quinnrong@tencent.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-01 07:28:13 -07:00
Yuxuan Zhang
ed70f3c64f
Add GLM4.1V model (Draft) (#19331)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-01 12:48:26 +00:00
Kyle Sayers
9025a9a705
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper (#20046) 2025-07-01 19:20:34 +09:00
Lionel Villard
c05596f1a3
[Perf] Validate @config in pre-commit instead of dynamically (#20200)
Signed-off-by: Lionel Villard <villard@us.ibm.com>
2025-07-01 05:10:28 -04:00
TY-AMD
96453cfa83
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine (#19067)
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
2025-07-01 16:12:19 +08:00
Varun Sundar Rabindranath
08d81f1014
[Bugfix] Fix deepep tests (#20288)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-01 15:29:08 +08:00
Li, Jiang
6cc1e7d96d
[CPU] Update custom ops for the CPU backend (#20255)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-01 07:25:03 +00:00
czhu-cohere
9909726d2a
Enable ZP Support for Machete (#20268)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 07:12:20 +00:00
Alex Kogan
27949354fa
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference (#18768)
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-07-01 05:44:38 +00:00
fyuan1316
e28533a16f
[Bugfix] Fix include prompt in stream response when echo=true (#15233)
Signed-off-by: Yuan Fang <yuanfang@alauda.io>
2025-07-01 01:30:14 +00:00
Luka Govedič
6d42ce8315
[CLI] Improve CLI arg parsing for -O/--compilation-config (#20156)
Signed-off-by: luka <luka@neuralmagic.com>
2025-07-01 01:03:13 +00:00
Kyle Sayers
d8cf819a9a
[Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models (#20058)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-06-30 17:26:49 +00:00
Wentao Ye
551ef1631a
[Unit Test] Add unit test for deep gemm (#20090)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-30 10:26:42 -06:00
Woosuk Kwon
2863befce3
[Optimization] Use Shared CachedRequestData Instance Across All Requests (#20232)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-30 09:07:50 -07:00
redmoe-moutain
65b1cbb138
[Model] support dots1 (#18254)
Signed-off-by: redmoe-moutain <agiredmoe@gmail.com>
2025-06-29 19:34:36 -07:00
Dipika Sikka
6f2f53a82d
[Quantization] Add compressed-tensors NVFP4 MoE Support (#19990)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2025-06-29 22:05:40 +00:00
Michael Goin
7b1895e6ce
[CI Fix] Try fixing eagle e2e test OOM by reducing block allocation (#20213)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-06-29 10:31:37 +08:00
Wentao Ye
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx (#20187)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-28 22:06:38 +00:00
Stan Wozniak
daec9dea6e
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution (#20137)
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
2025-06-28 08:16:41 -07:00