5996 Commits

Author SHA1 Message Date
zifeitong
a71e4765cc
[Bugfix] Fix Qwen2.5-VL quantized model weights loading (#23512)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
2025-08-25 10:40:22 +08:00
Noam Gat
39971db3aa
Frontend: Adding LM Format Enforcer support to V1 engine (#22564)
Signed-off-by: Noam Gat <noamgat@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-24 19:31:22 -07:00
Ming Yang
504d914314
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 (#23504)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-08-24 18:06:35 -07:00
Didier Durand
47455c424f
[Doc: ]fix various typos in multiple files (#23487)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-25 00:04:04 +00:00
Lucia Fang
c7fc6b1354
fix incompatibililty with non cuda platform for nvfp4 (#23478)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>
2025-08-24 15:35:41 -07:00
Woosuk Kwon
ad78868450
[Misc] Remove unused slot_mapping buffer (#23502)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-24 14:03:36 -07:00
Cyrus Leung
e2db1164a1
[Model] Enable BLOOM on V1 (#23488)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-24 13:30:47 +00:00
汪志鹏
416f05929a
[New Model]Donut model (#23229)
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
2025-08-24 12:52:24 +00:00
rongfu.leng
1b9b16649c
[Misc] update dict parse to EPLBConfig from json dumps to dict unpacking (#23305)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-08-24 08:06:34 +00:00
czhu-cohere
e76e233540
[kernel] Support W4A8 on Hopper (#23198)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-08-24 06:18:04 +00:00
Benji Beck
a75277285b
Migrate Paligemma inputs to TensorSchema (#23470)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-24 04:56:56 +00:00
22quinn
9dc30b7068
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks (#23477)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-08-24 12:56:17 +08:00
Benji Beck
053278a5dc
Migrate Pixtral inputs to TensorSchema (#23472)
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-24 04:55:53 +00:00
Jiangyun Zhu
c55c028998
[gpt-oss] Streaming Output for Python Tool (#23409)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-08-24 04:42:38 +00:00
Jee Jee Li
65197a5fb3
[Misc] Modify CacheConfig import (#23459)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-23 06:05:27 +00:00
Xu Wenqing
b8f17f5d98
Support DeepSeek-V3.1 tool call (#23454)
Signed-off-by: Xu Wenqing <xuwq1993@qq.com>
2025-08-23 05:50:16 +00:00
Cyrus Leung
b4e9fd811f
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000)" (#23396)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-23 04:16:48 +00:00
Chenxi Yang
308fa287a8
Add glm4.5v tp2,4 fp8 config on H100_80GB (#23443)
Co-authored-by: Chenxi Yang <cxyang@meta.com>
2025-08-23 02:54:19 +00:00
Daifeng Li
fa78de9dc3
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs (#22527)
Signed-off-by: feng <fengli1702@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-22 20:53:21 -06:00
WeiQing Chen
23c939fd30
[Model] Support DP for ViT on MiniCPM-V-4 (#23327)
Signed-off-by: ycyaw66 <497410282@qq.com>
Co-authored-by: ycyaw66 <497410282@qq.com>
2025-08-23 02:14:41 +00:00
Nick Hill
add1adfec7
[BugFix] Fix MinPLogitsProcessor.update_states() (#23401)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-23 08:22:11 +08:00
Nick Hill
c80c53a30f
[BugFix] Fix batch updates for pooling models (#23398)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-23 08:20:41 +08:00
elvischenv
24d0c9e6ed
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel (#22703)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-22 22:09:05 +00:00
rasmith
cc7ae5e7ca
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model (#22281)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-08-22 21:47:57 +00:00
Ilya Markov
0313cf854d
[PERF] PyTorch Symmetric Memory All-Reduce (#20759)
Signed-off-by: ilmarkov <imarkov@redhat.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: ilmarkov <imarkov@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-08-22 15:39:08 -06:00
Shiyan Deng
da65bec309
add an env var for path to pre-downloaded flashinfer cubin files (#22675) 2025-08-22 19:25:45 +00:00
Isotr0py
4645024d3a
[Quantization] Allow GGUF quantization to skip unquantized layer (#23188)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-22 13:04:22 -06:00
Isotr0py
cd7a3df26f
[Bugfix] Fix broken Florence-2 model (#23426)
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com>
2025-08-22 17:50:52 +00:00
Isotr0py
32d2b4064f
[Model] Add Ovis2.5 PP support (#23405)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-22 17:46:34 +00:00
Didier Durand
22cf679aad
[Doc]: fix various typos in multiple files (#23179)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-08-22 10:38:46 -07:00
bppps
424fb7a5d2
[BugFix] Fix the issue where image embeddings were incorrectly split.… (#23366)
Signed-off-by: bppps <bpppsaka@gmail.com>
Co-authored-by: zouyu.zzx <zouyu.zzx@alibaba-inc.com>
Co-authored-by: bppps <bpppsaka@gmail.com>
2025-08-22 16:56:46 +00:00
PapaGoose
88491c1b6b
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support (#23337) 2025-08-22 16:39:19 +00:00
Ning Xie
325aa3dee9
[Misc] local import code clean (#23420)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-22 14:01:35 +00:00
杨朱 · Kiki
695e7adcd2
[misc] Remove outdate comment about runai_model_streamer (#23421)
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-08-22 13:08:53 +00:00
Russell Bryant
281710ef9a
[Attention] Allow V1 flash_attn to support cross-attention (#23297)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-08-22 12:10:16 +00:00
Woosuk Kwon
808d2e9aa0
[Misc] Move M-RoPE init logic to _init_mrope_positions (#23422)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-22 03:07:22 -07:00
Li, Jiang
88016c372a
[Bugfix] Fix pooling models on CPU backend (#23392)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-08-22 09:47:17 +00:00
Benji Beck
998720859c
Migrate MiniCPMOAudioInputs to TensorSchema (#21847)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 16:43:29 +08:00
Guillaume Calmettes
0ba1b54ac6
[gpt-oss] add input/output usage in responses api when harmony context is leveraged (#22667)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2025-08-22 08:32:24 +00:00
Flora Feng
53415653ff
[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator (#23079)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2025-08-21 22:30:48 -07:00
Chen Zhang
17373dcd93
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models (#23154)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-22 05:05:59 +00:00
Bin Jia
5964069367
[New Model] Add Seed-Oss model (#23241)
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-22 04:58:10 +00:00
Wentao Ye
394591e343
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement (#23351)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-21 21:01:08 -07:00
Benji Beck
0b9cc56fac
Migrate MllamaImagePixelInputs to TensorSchema (#22020)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-08-22 11:28:49 +08:00
Cyrus Leung
8896eb72eb
[Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed (#18800)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-22 10:56:57 +08:00
Matthew Bonanni
19fe1a0510
[Kernel] Add FP8 support with FlashMLA backend (#22668)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
2025-08-22 02:26:32 +00:00
22quinn
480bdf5a7b
[Core] Support custom executor qualname (#23314)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-22 09:40:54 +08:00
Kebe
5368f76855
[Feature][Responses API] Support logprobs(non-stream) (#23319)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-08-21 23:09:16 +00:00
Michael Goin
3bbe11cc13
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm (#23265)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-21 17:56:15 -04:00
Woosuk Kwon
800349c2a5
[Structured Outputs] Refactor bitmask construction into get_grammar_bitmask (#23361)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-21 20:53:33 +00:00