Alexander Matveev
|
ccd21e1993
|
[V1] Fix profiling.py
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
|
2025-04-11 18:36:37 +00:00 |
|
Nicolò Lucchesi
|
4d022cbc75
|
[TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models (#16483)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-04-11 17:06:14 +00:00 |
|
Richard Zou
|
70de35a881
|
Fix erroneous "model doesn't support compile" warning (#16486)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-04-11 16:24:36 +00:00 |
|
Tomasz Zielinski
|
34b2cf3b33
|
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU (#12779)
Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com>
|
2025-04-11 07:38:36 -07:00 |
|
chaow-amd
|
9e90c9f73f
|
[Bugfix] Fix bugs of running Quark quantized models (#16236)
Signed-off-by: chaow <chaow@amd.com>
|
2025-04-11 10:18:32 -04:00 |
|
DefTruth
|
e9528f6dc6
|
[Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173)
Signed-off-by: DefTruth <qiustudent_r@163.com>
|
2025-04-11 06:50:50 -06:00 |
|
Jee Jee Li
|
a26f59ccbc
|
[Misc] Raise error for V1 not supporting Long LoRA. (#16415)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-04-11 01:51:20 -07:00 |
|
Michael Goin
|
aa3b3d76e0
|
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (#16447)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-04-11 08:09:52 +00:00 |
|
Jee Jee Li
|
f7030df3be
|
[Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner (#15990)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-04-11 15:32:37 +08:00 |
|
DefTruth
|
905e91e9ac
|
Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" (#16453)
|
2025-04-11 06:44:22 +00:00 |
|
Alex Brooks
|
f8f9c0ba62
|
[Bugfix] Don't set an upper bound on repetition penalty (#16403)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-04-11 14:19:40 +08:00 |
|
Yong Hoon Shin
|
99ef59cf7f
|
[Llama4] Enable attention temperature tuning by default for long context (>32k) (#16439)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
|
2025-04-10 21:26:07 -07:00 |
|
Nicolò Lucchesi
|
3cc9af88ff
|
[TPU][V1] Disable per-request seed/Generator (#16172)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-04-10 17:05:44 -04:00 |
|
Cyrus Leung
|
56d4aefa33
|
[VLM] Avoid unnecessary dummy multimodal data during processing (#16416)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-10 19:32:14 +00:00 |
|
Nick Hill
|
dd143ef541
|
[V1] Zero-copy tensor/ndarray serialization/transmission (#13790)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-04-10 19:23:14 +00:00 |
|
Chih-Chieh Yang
|
daefed052c
|
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B (#15423)
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
|
2025-04-10 19:07:07 +00:00 |
|
Lily Liu
|
e8224f3dca
|
[V1][Spec Decode] Eagle Model loading (#16035)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
|
2025-04-10 11:21:48 -07:00 |
|
Russell Bryant
|
9665313c39
|
[V1] Set structured output backend to auto by default (#15724)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-04-10 17:53:26 +00:00 |
|
Harry Mellor
|
0c54fc7273
|
Improve configs - ParallelConfig (#16332)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-04-10 17:34:37 +00:00 |
|
Nicolò Lucchesi
|
c1b57855ec
|
[TPU][V1] Use language_model interface for getting text backbone in MM (#16410)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-04-10 17:32:04 +00:00 |
|
Cyrus Leung
|
83b824c8b4
|
[VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item (#16408)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-10 09:06:58 -07:00 |
|
Lu Fang
|
7678fcd5b6
|
Fix the torch version parsing logic (#15857)
|
2025-04-10 07:37:47 -07:00 |
|
Ye (Charlotte) Qi
|
61de3ef74b
|
[Model] Remove image mm limit for LLaMa4 (#16365)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
|
2025-04-10 09:36:27 +00:00 |
|
Michael Goin
|
c70cf0fe06
|
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models (#16038)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-04-10 15:08:47 +08:00 |
|
Cyrus Leung
|
a5d11a54dc
|
[Bugfix] Fix validation error for text-only Mllama 3.2 (#16377)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-10 14:19:42 +08:00 |
|
Aaron Ang
|
a9bd832fc5
|
[Model] use AutoWeightsLoader for deepseek_v2, internlm2 (#16383)
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com>
|
2025-04-09 23:01:00 -07:00 |
|
Michael Goin
|
baada0e737
|
[Bugfix][TPU] Fix TPU validate_request (#16369)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2025-04-10 12:55:12 +08:00 |
|
Benjamin Kitor
|
82eb61dd4c
|
[misc] use tqdm.auto where appropriate (#16290)
Signed-off-by: Benjamin Kitor <bkitor@gigaio.com>
|
2025-04-09 21:54:54 -07:00 |
|
Jintao
|
4aed0ca6a2
|
[bugfix] Avoid the time consumption caused by creating dummy videos. (#16371)
|
2025-04-10 04:30:05 +00:00 |
|
Chengji Yao
|
1621b25288
|
[TPU] Fix dummy loading OOM (#16372)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-04-10 04:06:16 +00:00 |
|
Aaron Ang
|
a564797151
|
[Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral (#16325)
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com>
|
2025-04-09 20:07:40 -07:00 |
|
Guillaume Calmettes
|
1da6a09274
|
[Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer (#14094)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
|
2025-04-09 19:43:09 -07:00 |
|
Yuxuan Zhang
|
1e44ffc3ff
|
Add GLM-4-0414 support (#16338)
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com>
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: yihong <zouzou0208@gmail.com>
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com>
|
2025-04-10 09:19:42 +08:00 |
|
Chengji Yao
|
a454748544
|
[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues (#16275)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
|
2025-04-09 18:51:51 -06:00 |
|
Joe Runde
|
cb391d85dc
|
[Hardware] add platform-specific request validation api (#16291)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2025-04-09 12:50:01 -07:00 |
|
Guillaume Calmettes
|
c3b5189137
|
[Bugfix] catch AssertionError in MistralTokenizer as ValueError (#16344)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
|
2025-04-09 17:33:24 +00:00 |
|
Guillaume Calmettes
|
98d01d3ce2
|
[Bugfix][Frontend] respect provided default guided decoding backend (#15476)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
|
2025-04-09 05:11:10 -07:00 |
|
Nicolò Lucchesi
|
d55244df31
|
[Model] Add SupportsMultiModal.get_language_model interface (#16007)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-04-09 04:12:54 -07:00 |
|
yihong
|
04149cce27
|
[BugFix] fix some typos found by typos. (#16314)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
|
2025-04-09 03:43:59 -07:00 |
|
ajayvohra2005
|
24834f4894
|
update neuron config (#16289)
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com>
|
2025-04-09 03:43:22 -07:00 |
|
Lucia Fang
|
ec7da6fcf3
|
[BugFix] llama4 qknorm should be not shared across head (#16311)
Signed-off-by: Lu Fang <fanglu@fb.com>
|
2025-04-09 00:59:14 -07:00 |
|
yihong
|
819d548e8a
|
[BugFix] logger is not callable (#16312)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
|
2025-04-09 00:59:02 -07:00 |
|
Cyrus Leung
|
e484e02857
|
[Bugfix] Avoid transferring cached multi-modal items from P0 to P1 (#16273)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-04-09 00:51:27 -07:00 |
|
Russell Bryant
|
cb84e45ac7
|
[Core] Upgrade to xgrammar 0.1.18, add cache size limit (#16283)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-04-08 19:13:22 -07:00 |
|
rongfu.leng
|
4716377fbc
|
[Feature] Estimate max-model-len use available KV cache memory (#16168)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
|
2025-04-08 19:12:51 -07:00 |
|
TJian
|
2976dc27e9
|
[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs (#16198)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
|
2025-04-08 19:12:34 -07:00 |
|
Chauncey
|
102bf967f0
|
[Model] Add smolvlm support (#16017)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-04-08 19:12:17 -07:00 |
|
yueshen2016
|
1f4b09b525
|
Add support to modelopt quantization of Mixtral model (#15961)
Signed-off-by: Yue <yueshen@nvidia.com>
|
2025-04-09 01:53:31 +00:00 |
|
Jinzhen Lin
|
db10422184
|
[Bugfix] fix deepseek fp16 scale bug (#14809)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-04-08 16:56:09 -04:00 |
|
Lucas Wilkinson
|
e1a2c699dd
|
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context (#16209)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-04-08 18:56:51 +00:00 |
|