baonudesifeizhai
|
ff82fce3b2
|
fix for cpu
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
|
2025-12-23 15:58:39 -05:00 |
|
baonudesifeizhai
|
e713ba4039
|
fix and add unit test
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
|
2025-12-23 14:13:20 -05:00 |
|
baonudesifeizhai
|
0b5e466c8d
|
fix
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
|
2025-12-22 21:39:34 -05:00 |
|
baonudesifeizhai
|
e8985d9716
|
Add workaround for TorchInductor get_raw_stream bug
|
2025-12-17 18:16:48 -05:00 |
|
baonudesifeizhai
|
9d70afe6c6
|
Add workaround for TorchInductor get_raw_stream bug
|
2025-12-17 18:13:53 -05:00 |
|
Andrey Talman
|
e06d0bf0aa
|
2.9.1 PyTorch release update (#28495)
|
2025-12-17 12:20:22 -08:00 |
|
Matthew Bonanni
|
7eb6cb6c18
|
[Attention] Update tests to remove deprecated env vars (#30563)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-12-17 09:49:59 -08:00 |
|
Cyrus Leung
|
2497228ad4
|
[Chore] Factor out logic for requesting initial memory (#30868)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-12-17 07:32:17 -08:00 |
|
KimHyemin
|
196cdc3224
|
[Model] Gemma3: Support untied word embeddings (#30827)
Signed-off-by: www-spam <panmahm@naver.com>
|
2025-12-17 07:11:18 -08:00 |
|
高鑫崧
|
b7b6a60aca
|
Adapt the old parameter enable_thinking in chat_template_kwargs (#30852)
Signed-off-by: xinsong.gao <1418762819@qq.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
|
2025-12-17 07:10:59 -08:00 |
|
Jialin Ouyang
|
6e9dbcc50e
|
[Fix] uniform decode batch check (#30747)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
|
2025-12-17 19:58:43 +08:00 |
|
Hank_
|
6482e3895b
|
chores: adjust the attn register param order (#30688)
Signed-off-by: Hank <hcc.mayday@gmail.com>
|
2025-12-17 19:58:16 +08:00 |
|
Harry Mellor
|
fb980eb2fd
|
Fix lazy import (#30858)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-12-17 03:33:50 -08:00 |
|
baoqian426
|
84896fda22
|
[Bugfix] deepseek-V3.2 self.weights_proj has no bias (#30841)
Signed-off-by: baoqian <1354987947@qq.com>
Signed-off-by: baoqian426 <1354987947@qq.com>
|
2025-12-17 03:32:34 -08:00 |
|
Chauncey
|
9ad5b21710
|
[Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory (#30749)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-12-17 02:27:30 -08:00 |
|
Wentao Ye
|
f284d7bd0c
|
[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv (#30823)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-17 02:00:35 -08:00 |
|
Zhengxu Chen
|
53cd7f868b
|
[compile] Recompile graph module during Dynamo cache loading. (#30743)
Signed-off-by: Zhengxu Chen <zhxchen17@fb.com>
|
2025-12-17 02:00:12 -08:00 |
|
danielafrimi
|
7b966ae2ba
|
[Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) (#30785)
Signed-off-by: <>
Co-authored-by: root <root@gpu-937.slurm-workers-slurm.slurm.svc.cluster.local>
|
2025-12-17 01:56:38 -08:00 |
|
Zhengxu Chen
|
9db1db5949
|
[compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors (#30809)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
|
2025-12-17 01:56:24 -08:00 |
|
Zhengxu Chen
|
177c391db2
|
[compile] Disable aot when eager backend is used. (#30810)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
|
2025-12-17 01:55:56 -08:00 |
|
Michael Goin
|
519ef9a911
|
[UX] Make vllm bench serve discover model by default and use --input-len (#30816)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-17 01:55:30 -08:00 |
|
Ye (Charlotte) Qi
|
a100152288
|
[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#30842)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
|
2025-12-17 01:54:21 -08:00 |
|
Andrew Xia
|
4c054d89aa
|
[Doc][ResponsesAPI] add documentation (#30840)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
|
2025-12-17 01:53:02 -08:00 |
|
Xinyu Chen
|
3b1d440ede
|
CustomOp: grouped topk (#29575)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
|
2025-12-17 17:43:00 +08:00 |
|
Asaf Joseph Gardin
|
a9e15c21ef
|
[Mamba] Removed disable cascade attn in MambaModelConfig (#30712)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
|
2025-12-17 08:48:53 +00:00 |
|
Robin
|
20fda43151
|
[Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction (#30555)
Signed-off-by: WangErXiao <863579016@qq.com>
|
2025-12-17 16:37:57 +08:00 |
|
Yan Ma
|
4f735babb7
|
[XPU] fix broken fp8 online quantization for XPU platform (#30831)
Signed-off-by: Yan Ma <yan.ma@intel.com>
|
2025-12-17 00:28:13 -08:00 |
|
Li, Jiang
|
0cd5353644
|
[Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models (#30829)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-16 23:25:12 -08:00 |
|
Michael Goin
|
d4d2751732
|
Update note comment for flashinfer attention warmup (#30711)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 21:29:03 -08:00 |
|
Grzegorz K. Karch
|
f5db6385a1
|
Fix nemotron_nas intermediate_size computation (#30795)
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
|
2025-12-17 01:06:28 +00:00 |
|
Nicolò Lucchesi
|
e087fbc393
|
[MM] Pass FA version in ViT Attn (#30756)
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-12-17 07:54:45 +08:00 |
|
TJian
|
2410132bb1
|
[ROCm] [Bugfix] Fix torch sdpa hallucination (#30789)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-12-16 15:32:43 -08:00 |
|
Jinzhen Lin
|
ce96857fdd
|
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-12-16 14:35:28 -08:00 |
|
Roger Wang
|
f5f51e5931
|
[Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Sun Kim <sunytokki@gmail.com>
|
2025-12-16 14:18:17 -08:00 |
|
Lucas Wilkinson
|
9fec0e13d5
|
[Attention] Cache attention metadata builds across hybrid KV-cache groups (#29627)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
|
2025-12-16 17:10:16 -05:00 |
|
jiahanc
|
254a7f8fd6
|
[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (#30014)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
|
2025-12-16 13:01:48 -08:00 |
|
Nicolò Lucchesi
|
ca702a14dc
|
[Frontend] Add max-completion-token option to transcription/translation endpoints (#30769)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-12-16 19:36:49 +00:00 |
|
Michael Goin
|
10ee1c64cf
|
[CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test (#30723)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-12-16 14:28:34 -05:00 |
|
Mark McLoughlin
|
66c3537e5d
|
[Docs][API] Remove warning about LoRARequest being internal-only (#30774)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-12-16 08:35:46 -08:00 |
|
Harry Mellor
|
e1625498f4
|
Update where bytes_to_unicode is imported from (#30771)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-12-16 08:05:01 -08:00 |
|
Harry Mellor
|
0b0acc758e
|
Remove head_mask from Ultravox and Swin (#30764)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-12-16 08:02:41 -08:00 |
|
Ming Yang
|
ce12b407f2
|
[TRTLLM] Remove the MoE GEMM weight name change (#30713)
Signed-off-by: Ming Yang <minos.future@gmail.com>
|
2025-12-16 11:01:38 -05:00 |
|
Wentao Ye
|
59bd5f6a71
|
[Feat] Enable eplb with default all2all backend (#30559)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-12-16 10:33:52 -05:00 |
|
Lucas Wilkinson
|
00a8d7628c
|
[BugFix] Fix memory spike in workspace allocation (#30744)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-12-16 06:46:22 -08:00 |
|
Nicolò Lucchesi
|
75eb302a2e
|
[Bugfix] Whisper fix number of allocated CrossAttn blocks per-request (#30772)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-12-16 14:20:19 +00:00 |
|
Pleaplusone
|
9dbbc59b15
|
[ROCm][MTP] Support MTP for AITER MLA backend (#28624)
Signed-off-by: ganyi <ygan@amd.com>
|
2025-12-16 14:10:26 +00:00 |
|
Boyuan Feng
|
104003dc77
|
update piecewise cudagraph warning when splitting_ops=[] (#30728)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
|
2025-12-16 06:09:34 -08:00 |
|
TJian
|
d0fb572929
|
[ROCm] [AITER] [DOC] Add usage description about check functions in _aiter_ops (#30586)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-12-16 13:50:47 +00:00 |
|
Harry Mellor
|
6f15ac5de7
|
Don'e assume position_embedding_type will be present for BERT and RoBERTa models (#30770)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-12-16 13:40:26 +00:00 |
|
Junru Shen
|
676db55eec
|
[Bugfix] Fix prefix_repetition routing in bench throughput (#29663)
Signed-off-by: Junru Shen <jrshen.sjr@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-12-16 01:37:15 -08:00 |
|