1788 Commits

Author SHA1 Message Date
Nick Hill
30172b4947
[V1] Optimize handling of sampling metadata and req_ids list (#13244)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-18 12:15:33 -08:00
Murali Andoorveedu
a4d577b379
[V1][Tests] Adding additional testing for multimodal models to V1 (#13308)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
2025-02-18 09:53:14 -08:00
Liangfu Chen
3809458456
[Bugfix] Fix invalid rotary embedding unit test (#13431)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-02-18 11:52:03 +00:00
Michael Goin
b53d79983c
Add outlines fallback when JSON schema has enum (#13449)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-18 06:49:41 +00:00
Isotr0py
67ef8f666a
[Model] Enable quantization support for transformers backend (#12960) 2025-02-17 19:52:47 -08:00
Woosuk Kwon
cd4a72a28d
[V1][Spec decode] Move drafter to model runner (#13363)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 15:40:12 -08:00
Woosuk Kwon
4c21ce9eba
[V1] Get input tokens from scheduler (#13339)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 11:01:07 -08:00
Tyler Michael Smith
1f69c4a892
[Model] Support Mamba2 (Codestral Mamba) (#9292)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2025-02-17 20:17:50 +08:00
shangmingc
46cdd59577
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-16 19:32:26 -08:00
Cyrus Leung
5d2965b7d7
[Bugfix] Fix 2 Node and Spec Decode tests (#13341)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-16 22:20:22 +08:00
youkaichao
124776ebd5
[ci] skip failed tests for flashinfer (#13352)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-16 22:09:15 +08:00
wchen61
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2025-02-16 08:59:49 +00:00
Lily Liu
80f63a3966
[V1][Spec Decode] Ngram Spec Decode (#12193)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-15 18:05:11 -08:00
Cody Yu
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue (#13064) 2025-02-15 03:59:01 -08:00
Mark McLoughlin
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288) 2025-02-15 03:56:19 -08:00
Woosuk Kwon
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p (#13316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-14 17:29:51 -08:00
Aoyu
a12934d3ec
[V1][Core] min_p sampling support (#13191)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
2025-02-14 15:50:05 -08:00
Joe Runde
3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-14 15:36:07 -08:00
Michael Goin
5e5c8e091e
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-14 12:53:42 -08:00
Lu Fang
6224a9f620
Support logit_bias in v1 Sampler (#13079) 2025-02-14 04:34:59 -08:00
Alexander Matveev
45f90bcbba
[WIP] TPU V1 Support Refactored (#13049) 2025-02-14 00:21:53 -08:00
Kero Liang
b0ccfc565a
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126) 2025-02-13 22:39:20 -08:00
Varun Sundar Rabindranath
cbc40128eb
[V1] LoRA - Enable Serving Usecase (#12883)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-14 14:21:12 +08:00
Harry Mellor
f2b20fe491
Consolidate Llama model usage in tests (#13094) 2025-02-13 22:18:03 -08:00
Tyler Michael Smith
09545c0a94
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250) 2025-02-13 20:19:25 -08:00
Tyler Michael Smith
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-14 00:01:14 +00:00
Nicolò Lucchesi
d84cef76eb
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909) 2025-02-13 07:23:45 -08:00
Vaibhav Jain
37dfa60037
[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193) 2025-02-13 06:52:22 -08:00
Cyrus Leung
1bc3b5e71b
[VLM] Separate text-only and vision variants of the same model architecture (#13157) 2025-02-13 06:19:15 -08:00
Cyrus Leung
c9d3ecf016
[VLM] Merged multi-modal processor for Molmo (#12966) 2025-02-13 04:34:00 -08:00
Rui Qiao
9605c1256e
[V1][core] Implement pipeline parallel on Ray (#12996) 2025-02-13 08:02:46 +00:00
LikeSundayLikeRain
04f50ad9d1
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (#13097) 2025-02-12 23:11:26 -08:00
Isotr0py
bc55d13070
[VLM] Implement merged multimodal processor for Mllama (#11427) 2025-02-12 20:26:21 -08:00
Kaixi Hou
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization (#12784) 2025-02-12 19:51:51 -08:00
Michael Goin
14b7899d10
[CI] Fix failing FP8 cpu offload test (#13170)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-12 19:16:06 +00:00
Qubitium-ModelCloud
36a08630e8
[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control (#7086) 2025-02-12 09:19:43 -08:00
Jee Jee Li
82cabf53a3
[Misc] Delete unused LoRA modules (#13151) 2025-02-12 08:58:24 -08:00
Rafael Vasquez
314cfade02
[Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral (#12332) 2025-02-12 08:29:56 -08:00
Lingfan Yu
e92694b6fe
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
2025-02-11 21:12:37 -08:00
Christian Pinto
974dfd4971
[Model] IBM/NASA Prithvi Geospatial model (#12830) 2025-02-11 20:34:30 -08:00
Keyun Tong
3ee696a63d
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (#12518)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
2025-02-12 12:25:58 +08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
6c4dbe23eb
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (#12962)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-02-12 00:21:50 +08:00
Mark McLoughlin
75e6e14516
[V1][Metrics] Add several request timing histograms (#12644)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-02-11 10:14:00 -05:00
மனோஜ்குமார் பழனிச்சாமி
110f59a33e
[Bugfix] fix flaky test (#13089)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
2025-02-11 14:41:20 +00:00
Cody Yu
41c5dd45b9
[V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592) 2025-02-11 08:27:25 +00:00
Ce Gao
fc6485d277
[Bugfix]: Reasoning output bug according to the chat template change (#13025)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-02-11 15:49:03 +08:00
Varun Sundar Rabindranath
78a141d768
[Misc] LoRA - Refactor Punica ops tests (#12970)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-11 07:26:03 +00:00
Florian Greinacher
cb080f32e3
[Bugfix] Support missing tool parameters in mistral tokenizer (#12884)
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com>
2025-02-11 03:33:33 +00:00
Farzad Abdolhosseini
08b2d845d6
[Model] Ultravox Model: Support v0.5 Release (#12912)
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>
2025-02-10 22:02:48 +00:00
மனோஜ்குமார் பழனிச்சாமி
2ae889052c
Fix seed parameter behavior in vLLM (#13007)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
2025-02-10 23:26:50 +08:00