Harry Mellor
cdc1fa12eb
Remove unused kwargs from model definitions ( #13555 )
2025-02-24 17:13:52 -08:00
afeldman-nm
befc402d34
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) ( #10980 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-02-24 08:29:41 -08:00
Jongseok Park
781096e385
Expert Parallelism (EP) Support for DeepSeek V2 ( #12583 )
2025-02-24 07:33:20 -08:00
Kevin H. Luu
f90a375593
[ci] Add logic to change model to S3 path only when S3 CI env var is on ( #13727 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-63-253.us-west-2.compute.internal>
2025-02-24 06:32:11 +00:00
Nick Hill
cbae7af552
[V1][BugFix] Fix engine core client shutdown hangs ( #13298 )
...
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.
Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-23 13:07:43 -08:00
youkaichao
eb24dc4a45
[v1] torchrun compatibility ( #13642 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-23 22:47:24 +08:00
Isotr0py
ba5106e519
[LMM] Implement merged multimodal processor for whisper ( #13278 )
2025-02-23 01:46:03 -08:00
Kevin H. Luu
2c5e637b57
[ci] Use env var to control whether to use S3 bucket in CI ( #13634 )
2025-02-22 19:19:45 -08:00
Sage Moore
558db8083c
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths ( #13095 )
2025-02-22 05:25:41 -08:00
Kaixi Hou
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm ( #13571 )
2025-02-22 05:24:05 -08:00
Keyun Tong
8db1b9d0a1
Support SSL Key Rotation in HTTP Server ( #13495 )
2025-02-22 05:17:44 -08:00
Jee Jee Li
105b8ce4c0
[Misc] Reduce LoRA-related static variable ( #13166 )
2025-02-22 00:21:30 -08:00
Mark McLoughlin
2cb8c1540e
[Metrics] Add --show-hidden-metrics-for-version CLI arg ( #13295 )
2025-02-22 00:20:45 -08:00
Mark McLoughlin
1cd981da4f
[V1][Metrics] Support vllm:cache_config_info ( #13299 )
2025-02-22 00:20:00 -08:00
Lu Fang
bb78fb318e
[v1] Support allowed_token_ids in v1 Sampler ( #13210 )
...
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-22 14:13:05 +08:00
Keyun Tong
0ffdf8ce0c
[HTTP Server] Make model param optional in request ( #13568 )
2025-02-21 21:55:50 -08:00
Lucas Wilkinson
288cc6c234
[Attention] MLA with chunked prefill ( #12639 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-21 15:30:12 -08:00
Kevin H. Luu
34ad27fe83
[ci] Fix metrics test model path ( #13635 )
2025-02-20 22:12:10 -08:00
Gabriel Marinho
1c3c975766
[FEATURE] Enables /score endpoint for embedding models ( #12846 )
2025-02-20 22:09:47 -08:00
Lingfan Yu
33170081f1
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth ( #13245 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
2025-02-20 17:45:45 -08:00
Joe Runde
bfbc0b32c6
[Frontend] Add backend-specific options for guided decoding ( #13505 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-02-20 15:07:58 -05:00
Harry Mellor
992e5c3d34
Merge similar examples in offline_inference into single basic example ( #12737 )
2025-02-20 04:53:51 -08:00
Kevin H. Luu
a64a84433d
[2/n][ci] S3: Use full model path ( #13564 )
...
Signed-off-by: <>
2025-02-20 01:20:15 -08:00
Kevin H. Luu
aa1e62d0db
[ci] Fix spec decode test ( #13600 )
2025-02-20 16:56:00 +08:00
youkaichao
ba81163997
[core] add sleep and wake up endpoint and v1 support ( #12987 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: cennn <2523403608@qq.com>
Co-authored-by: cennn <2523403608@qq.com>
2025-02-20 12:41:17 +08:00
Jee Jee Li
512368e34a
[Misc] Qwen2.5 VL support LoRA ( #13261 )
2025-02-19 18:37:55 -08:00
Kevin H. Luu
473f51cfd9
[3/n][CI] Load Quantization test models with S3 ( #13570 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-20 10:12:30 +08:00
Cyrus Leung
377d10bd14
[VLM][Bugfix] Pass processor kwargs properly on init ( #13516 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-19 13:13:50 +00:00
Yannick Schnider
423330263b
[Feature] Pluggable platform-specific scheduler ( #13161 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
2025-02-19 17:16:38 +08:00
Nick Hill
caf7ff4456
[V1][Core] Generic mechanism for handling engine utility ( #13060 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-19 17:09:22 +08:00
Lucia Fang
f525c0be8b
[Model][Speculative Decoding] DeepSeek MTP spec decode ( #12755 )
...
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-19 17:06:23 +08:00
Alex Brooks
983a40a8bb
[Bugfix] Fix Positive Feature Layers in Llava Models ( #13514 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
2025-02-19 08:50:07 +00:00
Kevin H. Luu
d5d214ac7f
[1/n][CI] Load models in CI from S3 instead of HF ( #13205 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-19 07:34:59 +00:00
Nick Hill
30172b4947
[V1] Optimize handling of sampling metadata and req_ids list ( #13244 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-18 12:15:33 -08:00
Murali Andoorveedu
a4d577b379
[V1][Tests] Adding additional testing for multimodal models to V1 ( #13308 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
2025-02-18 09:53:14 -08:00
Liangfu Chen
3809458456
[Bugfix] Fix invalid rotary embedding unit test ( #13431 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-02-18 11:52:03 +00:00
Michael Goin
b53d79983c
Add outlines fallback when JSON schema has enum ( #13449 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-18 06:49:41 +00:00
Isotr0py
67ef8f666a
[Model] Enable quantization support for transformers backend ( #12960 )
2025-02-17 19:52:47 -08:00
Woosuk Kwon
cd4a72a28d
[V1][Spec decode] Move drafter to model runner ( #13363 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 15:40:12 -08:00
Woosuk Kwon
4c21ce9eba
[V1] Get input tokens from scheduler ( #13339 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 11:01:07 -08:00
Tyler Michael Smith
1f69c4a892
[Model] Support Mamba2 (Codestral Mamba) ( #9292 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2025-02-17 20:17:50 +08:00
shangmingc
46cdd59577
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode ( #12304 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-16 19:32:26 -08:00
Cyrus Leung
5d2965b7d7
[Bugfix] Fix 2 Node and Spec Decode tests ( #13341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-16 22:20:22 +08:00
youkaichao
124776ebd5
[ci] skip failed tests for flashinfer ( #13352 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-16 22:09:15 +08:00
wchen61
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices ( #13187 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com>
2025-02-16 08:59:49 +00:00
Lily Liu
80f63a3966
[V1][Spec Decode] Ngram Spec Decode ( #12193 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-15 18:05:11 -08:00
Cody Yu
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue ( #13064 )
2025-02-15 03:59:01 -08:00
Mark McLoughlin
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 ( #13288 )
2025-02-15 03:56:19 -08:00
Woosuk Kwon
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p ( #13316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-14 17:29:51 -08:00
Aoyu
a12934d3ec
[V1][Core] min_p sampling support ( #13191 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
2025-02-14 15:50:05 -08:00