Nick Hill
|
30172b4947
|
[V1] Optimize handling of sampling metadata and req_ids list (#13244)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-02-18 12:15:33 -08:00 |
|
Murali Andoorveedu
|
a4d577b379
|
[V1][Tests] Adding additional testing for multimodal models to V1 (#13308)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
|
2025-02-18 09:53:14 -08:00 |
|
Liangfu Chen
|
3809458456
|
[Bugfix] Fix invalid rotary embedding unit test (#13431)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
|
2025-02-18 11:52:03 +00:00 |
|
Michael Goin
|
b53d79983c
|
Add outlines fallback when JSON schema has enum (#13449)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-02-18 06:49:41 +00:00 |
|
Isotr0py
|
67ef8f666a
|
[Model] Enable quantization support for transformers backend (#12960)
|
2025-02-17 19:52:47 -08:00 |
|
Woosuk Kwon
|
cd4a72a28d
|
[V1][Spec decode] Move drafter to model runner (#13363)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-02-17 15:40:12 -08:00 |
|
Woosuk Kwon
|
4c21ce9eba
|
[V1] Get input tokens from scheduler (#13339)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-02-17 11:01:07 -08:00 |
|
Tyler Michael Smith
|
1f69c4a892
|
[Model] Support Mamba2 (Codestral Mamba) (#9292)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
|
2025-02-17 20:17:50 +08:00 |
|
shangmingc
|
46cdd59577
|
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-02-16 19:32:26 -08:00 |
|
Cyrus Leung
|
5d2965b7d7
|
[Bugfix] Fix 2 Node and Spec Decode tests (#13341)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-02-16 22:20:22 +08:00 |
|
youkaichao
|
124776ebd5
|
[ci] skip failed tests for flashinfer (#13352)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-16 22:09:15 +08:00 |
|
wchen61
|
dc0f7ccf8b
|
[BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187)
Signed-off-by: wchen61 <wchen61@foxmail.com>
|
2025-02-16 08:59:49 +00:00 |
|
Lily Liu
|
80f63a3966
|
[V1][Spec Decode] Ngram Spec Decode (#12193)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
|
2025-02-15 18:05:11 -08:00 |
|
Cody Yu
|
9206b3d7ec
|
[V1][PP] Run engine busy loop with batch queue (#13064)
|
2025-02-15 03:59:01 -08:00 |
|
Mark McLoughlin
|
2ad1bc7afe
|
[V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288)
|
2025-02-15 03:56:19 -08:00 |
|
Woosuk Kwon
|
e7eea5a520
|
[V1][CI] Fix failed v1-test because of min_p (#13316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-02-14 17:29:51 -08:00 |
|
Aoyu
|
a12934d3ec
|
[V1][Core] min_p sampling support (#13191)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
|
2025-02-14 15:50:05 -08:00 |
|
Joe Runde
|
3bcb8c75da
|
[Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-14 15:36:07 -08:00 |
|
Michael Goin
|
5e5c8e091e
|
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-02-14 12:53:42 -08:00 |
|
Lu Fang
|
6224a9f620
|
Support logit_bias in v1 Sampler (#13079)
|
2025-02-14 04:34:59 -08:00 |
|
Alexander Matveev
|
45f90bcbba
|
[WIP] TPU V1 Support Refactored (#13049)
|
2025-02-14 00:21:53 -08:00 |
|
Kero Liang
|
b0ccfc565a
|
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126)
|
2025-02-13 22:39:20 -08:00 |
|
Varun Sundar Rabindranath
|
cbc40128eb
|
[V1] LoRA - Enable Serving Usecase (#12883)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-02-14 14:21:12 +08:00 |
|
Harry Mellor
|
f2b20fe491
|
Consolidate Llama model usage in tests (#13094)
|
2025-02-13 22:18:03 -08:00 |
|
Tyler Michael Smith
|
09545c0a94
|
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250)
|
2025-02-13 20:19:25 -08:00 |
|
Tyler Michael Smith
|
c1e37bf71b
|
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-02-14 00:01:14 +00:00 |
|
Nicolò Lucchesi
|
d84cef76eb
|
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909)
|
2025-02-13 07:23:45 -08:00 |
|
Vaibhav Jain
|
37dfa60037
|
[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193)
|
2025-02-13 06:52:22 -08:00 |
|
Cyrus Leung
|
1bc3b5e71b
|
[VLM] Separate text-only and vision variants of the same model architecture (#13157)
|
2025-02-13 06:19:15 -08:00 |
|
Cyrus Leung
|
c9d3ecf016
|
[VLM] Merged multi-modal processor for Molmo (#12966)
|
2025-02-13 04:34:00 -08:00 |
|
Rui Qiao
|
9605c1256e
|
[V1][core] Implement pipeline parallel on Ray (#12996)
|
2025-02-13 08:02:46 +00:00 |
|
LikeSundayLikeRain
|
04f50ad9d1
|
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (#13097)
|
2025-02-12 23:11:26 -08:00 |
|
Isotr0py
|
bc55d13070
|
[VLM] Implement merged multimodal processor for Mllama (#11427)
|
2025-02-12 20:26:21 -08:00 |
|
Kaixi Hou
|
4fc5c23bb6
|
[NVIDIA] Support nvfp4 quantization (#12784)
|
2025-02-12 19:51:51 -08:00 |
|
Michael Goin
|
14b7899d10
|
[CI] Fix failing FP8 cpu offload test (#13170)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-02-12 19:16:06 +00:00 |
|
Qubitium-ModelCloud
|
36a08630e8
|
[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control (#7086)
|
2025-02-12 09:19:43 -08:00 |
|
Jee Jee Li
|
82cabf53a3
|
[Misc] Delete unused LoRA modules (#13151)
|
2025-02-12 08:58:24 -08:00 |
|
Rafael Vasquez
|
314cfade02
|
[Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral (#12332)
|
2025-02-12 08:29:56 -08:00 |
|
Lingfan Yu
|
e92694b6fe
|
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
|
2025-02-11 21:12:37 -08:00 |
|
Christian Pinto
|
974dfd4971
|
[Model] IBM/NASA Prithvi Geospatial model (#12830)
|
2025-02-11 20:34:30 -08:00 |
|
Keyun Tong
|
3ee696a63d
|
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (#12518)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
|
2025-02-12 12:25:58 +08:00 |
|
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
|
6c4dbe23eb
|
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (#12962)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
2025-02-12 00:21:50 +08:00 |
|
Mark McLoughlin
|
75e6e14516
|
[V1][Metrics] Add several request timing histograms (#12644)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-02-11 10:14:00 -05:00 |
|
மனோஜ்குமார் பழனிச்சாமி
|
110f59a33e
|
[Bugfix] fix flaky test (#13089)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
|
2025-02-11 14:41:20 +00:00 |
|
Cody Yu
|
41c5dd45b9
|
[V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592)
|
2025-02-11 08:27:25 +00:00 |
|
Ce Gao
|
fc6485d277
|
[Bugfix]: Reasoning output bug according to the chat template change (#13025)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
|
2025-02-11 15:49:03 +08:00 |
|
Varun Sundar Rabindranath
|
78a141d768
|
[Misc] LoRA - Refactor Punica ops tests (#12970)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2025-02-11 07:26:03 +00:00 |
|
Florian Greinacher
|
cb080f32e3
|
[Bugfix] Support missing tool parameters in mistral tokenizer (#12884)
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com>
|
2025-02-11 03:33:33 +00:00 |
|
Farzad Abdolhosseini
|
08b2d845d6
|
[Model] Ultravox Model: Support v0.5 Release (#12912)
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>
|
2025-02-10 22:02:48 +00:00 |
|
மனோஜ்குமார் பழனிச்சாமி
|
2ae889052c
|
Fix seed parameter behavior in vLLM (#13007)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
|
2025-02-10 23:26:50 +08:00 |
|