7928 Commits

Author SHA1 Message Date
Wentao Ye
226b452a20
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" (#21384)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-22 08:22:10 -07:00
Raushan Turganbay
f38ee34a0a
[feat] Enable mm caching for transformers backend (#21358)
Signed-off-by: raushan <raushan@huggingface.co>
2025-07-22 08:18:46 -07:00
Benjamin Bartels
b194557a6c
Adds parallel model weight loading for runai_streamer (#21330)
Signed-off-by: bbartels <benjamin@bartels.dev>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-07-22 08:15:53 -07:00
Wentao Ye
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant (#21083)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-22 07:27:15 -07:00
Duncan Moss
2c8db17cfd
[feat]: add SM100 support for cutlass FP8 groupGEMM (#20447)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-22 07:27:12 -07:00
Mickaël Seznec
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm (#21116)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-22 07:07:44 -07:00
Ning Xie
0df4d9b06b
[Misc] unify variable for LLM instance v2 (#21356)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-07-22 06:32:36 -07:00
Jialin Ouyang
ed25054577
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool (#21222)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-07-22 06:17:47 -07:00
Jialin Ouyang
10904e6d75
[benchmark] Port benchmark request sent optimization to benchmark_serving (#21209)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-07-22 05:28:00 -07:00
Jialin Ouyang
a32237665d
[Core] Optimize update checks in LogitsProcessor (#21245)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-07-22 05:27:18 -07:00
Kebe
bc8a8ce5ec
[Misc] Remove deprecated args in v0.10 (#21349)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-07-22 05:26:39 -07:00
Simon Mo
32142b3c62
[Bugfix] Fix eviction cached blocked logic (#21357)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-07-22 01:18:40 -07:00
Raghav Ravishankar
82b8027be6
Add arcee model (#21296)
Signed-off-by: alyosha-swamy <raghav@arcee.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-22 00:57:43 -07:00
rongfu.leng
3779eb8c81
[Feature][eplb] add verify ep or tp or dp (#21102)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-07-21 23:41:14 -07:00
Shu Wang
9e23ad9655
Update fp4 quantize API (#21327)
Signed-off-by: Shu Wang <shuw@nvidia.com>
2025-07-21 23:40:21 -07:00
Wentao Ye
e69a92a1ce
[Bug] DeepGemm: Fix Cuda Init Error (#21312)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-21 23:36:18 -07:00
Varun Sundar Rabindranath
8425f785ad
[Misc] DeepEPHighThroughtput - Enable Inductor pass (#21311)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-21 23:35:45 -07:00
Konrad Zawora
c17231e827
Fix kv_cache_dtype handling for out-of-tree HPU plugin (#21302)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
2025-07-21 23:35:14 -07:00
Wentao Ye
6e5b5ca580
[Refactor] Fix Compile Warning #1444-D (#21208)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-21 23:33:51 -07:00
Thomas Parnell
488d8a986a
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible (#21300)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-07-21 23:31:18 -07:00
Jialin Ouyang
af376ca19d
[Core] Minimize number of dict lookup in _maybe_evict_cached_block (#21281)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-07-21 22:37:34 -07:00
Ming Yang
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) (#21334)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-21 21:49:01 -07:00
Ratnam Parikh
90f1e55421
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU (#21338)
Signed-off-by: ratnampa <ratnam.parikh@intel.com>
2025-07-21 21:48:27 -07:00
Li, Jiang
5e70dcd6e6
[Doc] Fix CPU doc format (#21316)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-21 21:47:49 -07:00
Chaojun Zhang
25d585ab7b
[XPU] Enable external_launcher to serve as an executor via torchrun (#21021)
Signed-off-by: chzhang <chaojun.zhang@intel.com>
2025-07-21 21:47:35 -07:00
Lu Fang
8d0a01a5f2
[v1][sampler] Inplace logprobs comparison to get the token rank (#21283)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-07-21 13:47:47 -07:00
Himanshu Jaju
0ec82edda5
[perf] Speed up align sum kernels (#21079)
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
2025-07-21 11:19:23 -07:00
Michael Goin
005ae9be6c
Fix bad lm-eval fork (#21318) 2025-07-21 10:47:51 -07:00
Robert Shaw
29d1ffc5b4
[DP] Fix Prometheus Logging (#21257)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-07-21 09:11:35 -07:00
Lucas Wilkinson
304dce7ec0
[Attention] Clean up iRoPE in V1 (#21188)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-07-21 09:10:30 -07:00
Ming Yang
6ece16c4fe
[Misc] Add dummy maverick test (#21199)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-21 09:08:09 -07:00
simpx
a0e827e07c
[BugFix] make utils.current_stream thread-safety (#21252) (#21253)
Signed-off-by: simpx <simpxx@gmail.com>
2025-07-21 09:07:36 -07:00
Li, Jiang
a15a50fc17
[CPU] Enable shared-memory based pipeline parallel for CPU backend (#21289)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-21 09:07:08 -07:00
Woosuk Kwon
6dda13c86b
[Misc] Add sliding window to flashinfer test (#21282)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-21 08:37:49 -07:00
Zhiyu
6b46c4b653
Add Nvidia ModelOpt config adaptation (#19815)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-07-21 10:02:58 -04:00
Ning Xie
d97841078b
[Misc] unify variable for LLM instance (#20996)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-07-21 12:18:33 +01:00
Harry Mellor
e6b90a2805
[Docs] Make tables more space efficient in supported_models.md (#21291)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-21 02:25:02 -07:00
Harry Mellor
be54a951a3
[Docs] Fix hardcoded links in docs (#21287)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-21 02:23:57 -07:00
Cyrus Leung
042af0c8d3
[Model][1/N] Support multiple poolers at model level (#21227)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-21 02:22:21 -07:00
Cyrus Leung
378d33c392
[Bugfix] Fix missing placeholder in logger debug (#21280)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-20 22:50:06 -07:00
Huy Do
940af1f03a
Add the instruction to run e2e validation manually before release (#21023)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-07-20 22:29:18 -07:00
Simon Mo
92615d7fe8
[Docs] Add RFC Meeting to Issue Template (#21279)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-07-20 21:58:07 -07:00
Kay Yan
8188196a1c
[CI] Cleanup modelscope version constraint in Dockerfile (#21243)
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
2025-07-20 20:13:02 -07:00
Jiayi Yan
7ba34b1241
[bugfix] fix syntax warning caused by backslash (#21251) 2025-07-20 17:12:10 +00:00
Raushan Turganbay
9499e26e2a
[Model] Support VLMs with transformers backend (#20543)
Signed-off-by: raushan <raushan@huggingface.co>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-07-20 13:25:50 +00:00
Calvin Chen
51ba839555
[Model] use AutoWeightsLoader for bart (#18299)
Signed-off-by: calvin chen <120380290@qq.com>
2025-07-20 08:15:50 +00:00
Seiji Eicher
d1fb65bde3
Enable v1 metrics tests (#20953)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
v0.10.0rc1
2025-07-20 03:22:02 +00:00
Chengji Yao
3a1d8940ae
[TPU] support fp8 kv cache quantization (#19292)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-20 03:01:00 +00:00
Thomas Parnell
2b504eb770
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. (#21233)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-07-19 16:09:58 -07:00
Yuxuan Zhang
10eb24cc91
GLM-4 Update (#20736)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Lu Fang <fanglu@fb.com>
2025-07-19 22:40:31 +00:00