Lucas Wilkinson
dae6896977
[Perf] Reduce MLA CPU overheads in V1 ( #14384 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-06 19:59:14 -08:00
Brayden Zhong
c34eeec58d
[Bugfix] Correctly call cudaProfilerStop in benchmarks script ( #14183 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-03-07 00:42:49 +00:00
Daniel Li
ad60bbb2b2
[Doc] Fix a typo ( #14385 )
2025-03-06 16:31:52 -08:00
Chengji Yao
0578e5a462
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue ( #14310 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-06 23:31:05 +00:00
Michael Goin
04222984f8
[Docs] Add nsight guide to profiling docs ( #14298 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 14:19:58 -08:00
Michael Goin
6832707e90
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends ( #14221 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 14:18:29 -08:00
Michael Goin
6b2ef5cd17
[Bug] Fix Attention when ignored in by quant_method ( #14313 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 14:18:06 -08:00
Tyler Michael Smith
958adce478
[Bugfix] Fix use_direct_call condition in FusedMoE layer for ( #14382 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-06 14:17:21 -08:00
Tyler Michael Smith
99b0915d3b
[Kernel] Add needs_fixed_stride_order tag to most GEMMs ( #14306 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-06 14:17:09 -08:00
Thomas Parnell
8ca2b21c98
[CI] Disable spawn when running V1 Test ( #14345 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-03-06 21:52:46 +00:00
Michael Goin
d9292786e1
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa ( #13569 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 16:08:36 -05:00
Tyler Michael Smith
cc2f9b32c8
[Distributed] Add enable_expert_parallel arg ( #14305 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-06 18:54:45 +00:00
Himanshu Jaju
cd579352bf
[V1] Do not detokenize if sampling param detokenize is False ( #14224 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-06 10:40:24 -08:00
Ying Zhong
9f1710f1ac
Fix mla prefill context performance ( #13897 )
...
Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com>
2025-03-06 09:35:49 -08:00
Thomas Parnell
e642ec962c
Add authors to license header. ( #14371 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2025-03-06 08:43:09 -08:00
Dilip Gowda Bhagavan
ada19210a3
Adding cpu inference with VXE ISA for s390x architecture ( #12613 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com>
Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com>
2025-03-06 08:40:53 -08:00
Harry Mellor
bf0560bda9
Reinstate best_of for V0 ( #14356 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-06 08:34:22 -08:00
youkaichao
151b08e0fe
[RLHF] use worker_extension_cls for compatibility with V0 and V1 ( #14185 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-07 00:32:46 +08:00
Jitse Klomp
81b2f4a45f
[Doc] Fix date typo in README.md ( #14366 )
...
Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl>
2025-03-06 08:29:57 -08:00
Cyrus Leung
82551ad616
[Core] Don't use cache during multi-modal profiling ( #14336 )
2025-03-06 08:03:31 -08:00
courage17340
caac5c2e59
[Bugfix][Core] fix abort_seq_group and memory leak when n>1 ( #14326 )
...
Signed-off-by: courage17340 <courage17340@163.com>
2025-03-06 23:59:32 +08:00
Thomas Parnell
6bd1dd9d26
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend ( #14152 )
2025-03-06 07:39:16 -08:00
Irina Yuryeva
4f27044aab
[Doc] Correct beam_search using in generative_models.md ( #14363 )
2025-03-06 15:37:10 +00:00
Yanyi Liu
0ddc991f5c
[Doc] Update reasoning with stream example to use OpenAI library ( #14077 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com>
2025-03-06 13:20:37 +00:00
Nicolò Lucchesi
fa82b93853
[Frontend][Docs] Transcription API streaming ( #13301 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-06 10:39:35 +00:00
Nicolò Lucchesi
69ff99fdcd
[Core] Optimizing cross-attention QKVParallelLinear computation ( #12325 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
2025-03-06 09:37:26 +00:00
lkchen
5d802522a7
[V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 ( #14275 )
...
Signed-off-by: Linkun Chen <github@lkchen.net>
2025-03-06 08:58:41 +00:00
kYLe
1769928079
[Model] Update Paligemma multimodal processing with PromptUpdate ( #14015 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-03-06 08:31:38 +00:00
Pavani Majety
ed6ea06577
[Hardware] Update the flash attn tag to support Blackwell ( #14244 )
2025-03-05 22:01:37 -08:00
Nicolò Lucchesi
5ee10e990d
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention ( #11301 )
2025-03-05 20:00:53 -08:00
Varun Sundar Rabindranath
3dbd2d813a
[V1] LoRA - Enable more V1 tests ( #14315 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-06 11:55:42 +08:00
Ce Gao
f5f7f00cd9
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 ( #14114 )
2025-03-06 03:49:20 +00:00
Rui Qiao
abcc61e0af
[misc] Mention ray list nodes command to troubleshoot ray issues ( #14318 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-03-06 02:00:36 +00:00
Lucas Wilkinson
f6bb18fd9a
[BugFix] MLA + V1, illegal memory access and accuracy issues ( #14253 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-05 17:10:13 -08:00
Yuan Tang
71eaf8969b
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation ( #13850 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-05 17:09:29 -08:00
Michael Goin
ca100c90fe
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM ( #13917 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 17:08:51 -08:00
Russell Bryant
ffad94397d
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline ( #14243 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-05 17:08:02 -08:00
Lucas Wilkinson
4dacaa4a83
[BugFix] Fix prefix caching V0 MLA ( #14255 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com>
2025-03-05 17:07:42 -08:00
Tyler Michael Smith
a7ea35aa67
[Bugfix] Remove num_tokens_across_dp ( #14302 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-05 23:55:55 +00:00
pyc96
1e3e76b6cc
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch ( #14237 )
...
Signed-off-by: pyc96 <pychen96@gmail.com>
2025-03-05 22:22:40 +00:00
Lu Fang
53ea6ad830
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test ( #14308 )
...
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-05 21:41:18 +00:00
Serena
1b7624bf5c
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env ( #14267 )
2025-03-05 21:28:50 +00:00
Nick Hill
ac60dc7fe1
[V1][BugFix] Fix for mixed top_k batch ( #14301 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com>
2025-03-05 20:43:04 +00:00
Vincent
a4f1ee35d6
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 ( #13997 )
...
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-05 20:22:43 +00:00
Nick Hill
a32c8669ca
[V1][Minor] Remove obsolete FIXME comment ( #14304 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-05 11:59:23 -08:00
Simon Mo
ca2ca8de57
[Docs] Add Meta Slides ( #14297 )
...
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-05 08:30:23 -08:00
Isotr0py
f71b00a19e
[Bugfix] Fix broken vision language example ( #14292 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-05 15:57:10 +00:00
DaividFrank
8f808cf86e
prefix_caching.md: Fixed typo ( #14293 )
...
Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai>
2025-03-05 15:43:13 +00:00
Jee Jee Li
7bab4bb048
[Misc] Add Qwen2MoeForCausalLM moe tuning support ( #14276 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-05 23:11:29 +08:00
Isotr0py
e17e4488bd
[LoRA] Remove linear hack outside transformers backend ( #14177 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-05 15:06:28 +00:00