7162 Commits

Author SHA1 Message Date
Di Liu
5b3ad5ecf2
[DOC] fix doc typos (#19600)
Signed-off-by: Di Liu <liu-di@sjtu.edu.cn>
2025-06-17 11:34:53 +08:00
nguyenhoangthuan99
ede5c4ebdf
[Frontend] add chunking audio for > 30s audio (#19597)
Signed-off-by: nguyenhoangthuan99 <thuanhppro12@gmail.com>
2025-06-17 11:34:00 +08:00
Lucas Wilkinson
07334959d8
[Wheel Size] Only build FA2 8.0+PTX (#19336) 2025-06-17 12:32:49 +09:00
David Xia
119f683949
[doc] add project flag to gcloud TPU command (#19664)
Signed-off-by: David Xia <david@davidxia.com>
2025-06-17 01:00:09 +00:00
Conroy Cheers
0860087aff
[Fix] Fall back to Gloo when NCCL backend is unavailable (#19641)
Signed-off-by: conroy-cheers <conroy@corncheese.org>
2025-06-17 08:42:14 +08:00
Dipika Sikka
6bc7b57315
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 (#19563) 2025-06-16 17:33:51 -04:00
Russell Bryant
90f9c2eb5c
[V1] Change return type on get_multimodal_embeddings() (#19446)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-06-16 13:32:15 -04:00
qscqesze
387bdf0ab9
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (#19677)
Signed-off-by: QscQ <qscqesze@gmail.com>
2025-06-16 09:47:14 -07:00
bnellnm
5e5baa91aa
[Kernels] Use empty for modular MoE workspaces (#19667)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-16 14:58:01 +00:00
Chauncey
836d4ce140
[Bugfix] fix missing 'finish_reason': null in streaming chat (#19662)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-06-16 14:10:39 +00:00
Ning Xie
c3fec47bb7
[MISC] bump huggingface_hub pkg to 0.33.0 (#19547)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-16 05:22:28 -07:00
Isotr0py
1173804dca
[Bugfix] Fix TP inference for Flex attention backend (#19657)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-06-16 11:21:37 +00:00
Shawn Tan
4d5424029b
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (#19652)
Signed-off-by: Shawn Tan <shawntan@ibm.com>
2025-06-16 11:14:18 +00:00
Navanit Dubey
3e7506975c
[DOC] Add reasoning capability to vLLM streamlit code (#19557) 2025-06-16 07:09:12 -04:00
Nick Hill
ee35e96ac3
[BugFix] Don't catch BaseException when dumping execute_model errors (#19626)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-16 11:01:08 +00:00
Szymon Ożóg
dec66d253b
[Kernel] GGUF MMVQ kernel for multiple input vectors (#18754)
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
2025-06-16 17:33:26 +08:00
Russell Bryant
8d120701fd
[Docs] Move multiproc doc to v1 dir (#19651)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-06-16 09:10:12 +00:00
wang.yuqi
f40f763f12
[CI] Add mteb testing for rerank models (#19344) 2025-06-16 01:36:43 -07:00
Ning Xie
26bc46ef89
[MISC] typo fix (#19672)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-06-16 07:18:49 +00:00
Chengji Yao
a77aea59fd
[TPU] support attention head dim smaller than 128 (#19620)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-06-16 06:40:53 +00:00
Ye (Charlotte) Qi
b692e9cd07
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (#19660)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-06-16 06:30:29 +00:00
Francesco Bertolotti
367871a469
[Misc][Frontend] passthrough bad_words (#19564)
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai>
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com>
2025-06-16 05:05:13 +00:00
quanliu
92183b41f3
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (#18957)
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn>
2025-06-15 21:56:37 -07:00
Lu Fang
c6703d1e0d
[MISC] Remove unused variableds in C++ (#19609)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-06-15 20:05:28 -07:00
Isotr0py
a5e7242d5f
[Misc] Remove duplicate multiproc method setting for CPU platform (#19649)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-06-16 02:26:58 +00:00
Richard Zou
91b2c17a55
[CI/Build] Fix torch nightly CI dependencies part 2 (#19589) 2025-06-15 20:01:10 +08:00
Woosuk Kwon
055915e6ce
Enable prefix caching with full cuda graphs (#19617)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-06-15 01:05:05 -07:00
Wentao Ye
3d330c4c09
[Benchmark] Refactor benchmark script for fp8 & int8 (#19627)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-15 15:15:37 +08:00
22quinn
0b73736a0d
[Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check (#19339)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-06-15 13:43:48 +08:00
Lu Fang
ee1531bc38
[Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (#19644) 2025-06-14 21:15:41 -07:00
Ilya Markov
e13945f9dd
[Perf] Further tunings for SM100 FP8 CUTLASS kernel (#19566) 2025-06-14 17:25:10 -07:00
maobaolong
08500011d3
[Fix] Convert kv_transfer_config from dict to KVTransferConfig (#19262) 2025-06-14 12:32:07 -07:00
Konrad Zawora
861a0a0a39
[Bugfix] Don't attempt to use triton if no driver is active (#19561) 2025-06-14 12:30:54 -07:00
Huy Do
bc956b38d0
Only build CUTLASS MoE kernels on Hopper (#19648) 2025-06-14 11:44:15 -07:00
jiahanc
294fc1e2c9
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (#19500) 2025-06-14 09:34:28 -07:00
Isotr0py
2db9044ab6
[Bugfix] Fix auto dtype casting for BatchFeature (#19316)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-06-14 15:13:08 +00:00
Reid
6fa718a460
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts (#19593)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-06-14 16:54:52 +08:00
Lu Fang
06be858828
[Bugfix] Fix the speculative decoding test by setting the target dtype (#19633) 2025-06-13 20:57:32 -07:00
Saheli Bhattacharjee
d1e34cc9ac
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (#18354)
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai>
2025-06-14 11:07:36 +08:00
Nick Hill
bd517eb9fe
[BugFix] Fix DP Coordinator incorrect debug log message (#19624)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-14 00:18:03 +00:00
Concurrensee
d65668b4e8
Adding "AMD: Multi-step Tests" to amdproduction. (#19508)
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-06-13 17:08:51 -07:00
Woosuk Kwon
aafbbd981f
[torch.compile] Use custom ops when use_inductor=False (#19618) 2025-06-13 15:05:54 -07:00
Anna Pendleton
0f0874515a
[Doc] Add troubleshooting section to k8s deployment (#19377)
Signed-off-by: Anna Pendleton <pendleton@google.com>
2025-06-13 21:47:51 +00:00
Luka Govedič
3597b06a4f
[CUDA] Enable full cudagraph for FlashMLA (#18581)
Signed-off-by: luka <luka@neuralmagic.com>
2025-06-13 18:12:26 +00:00
Reid
1015296b79
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs (#19606)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-06-13 16:25:08 +00:00
Wentao Ye
ce9dc02c93
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl (#19573)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-13 06:12:15 -07:00
qscqesze
a24cb91600
[Model] Fix minimax model cache & lm_head precision (#19592)
Signed-off-by: qingjun <qingjun@minimaxi.com>
2025-06-13 12:08:20 +00:00
Nick Hill
7e8d97dd3f
[BugFix] Honor enable_caching in connector-delayed kvcache load case (#19435)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-06-13 09:46:32 +00:00
youkaichao
d70bc7c029
[torch.compile] reorganize the cache directory to support compiling multiple models (#19064)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-06-13 15:23:25 +08:00
Boyuan Feng
ce688ad46e
use base version for version comparison (#19587)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-06-13 15:09:34 +08:00