4330 Commits

Author SHA1 Message Date
Harry Mellor
823ab79633
Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-27 17:23:08 -07:00
Nicolò Lucchesi
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
2025-01-27 13:38:35 -08:00
Bowen Wang
2bc3fbba0c
[FlashInfer] Upgrade to 0.2.0 (#11194)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-27 18:19:24 +00:00
Woosuk Kwon
3f1fc7425a
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 09:40:04 -08:00
Mark McLoughlin
01ba927040
[V1][Metrics] Add initial Prometheus logger (#12416)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-27 12:26:28 -05:00
Lucas Wilkinson
103bd17ac5
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-27 10:40:00 -05:00
Isotr0py
ce69f7f754
[Bugfix] Fix gpt2 GGUF inference (#12467)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-27 18:31:49 +08:00
Woosuk Kwon
624a1e4711
[V1][Minor] Minor optimizations for update_from_output (#12454)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 01:09:27 -08:00
Isotr0py
372bf0890b
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-27 07:25:30 +00:00
Cyrus Leung
5204ff5c3f
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
v0.7.0
2025-01-26 21:26:44 -08:00
Pooya Davoodi
0cc6b383d7
[Frontend] Support scores endpoint in run_batch (#12430)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2025-01-27 04:30:17 +00:00
Woosuk Kwon
28e0750847
[V1] Avoid list creation in input preparation (#12457)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-26 19:57:56 -08:00
Yuan Tang
582cf78798
[DOC] Add link to vLLM blog (#12460)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-27 03:46:19 +00:00
Kyle Mistele
0034b09ceb
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
Signed-off-by: Kyle Mistele <kyle@mistele.com>
2025-01-26 19:58:45 -07:00
Tyler Michael Smith
72bac73067
[Build/CI] Fix libcuda.so linkage (#12424)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-26 21:18:19 +00:00
Lucas Wilkinson
68f11149d8
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-26 11:09:34 -08:00
Tyler Michael Smith
72f4880425
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450) 2025-01-26 10:39:03 -08:00
Tyler Michael Smith
aa2cd2c43d
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-26 19:59:58 +08:00
Matthew Hendrey
9ddc35220b
[Frontend] generation_config.json for maximum tokens(#12242)
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-26 19:59:25 +08:00
Roger Wang
a5255270c3
[Misc] Revert FA on ViT #12355 and #12435 (#12445) 2025-01-26 03:56:34 -08:00
Roger Wang
0ee349b553
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-26 00:47:42 -08:00
Keyun Tong
fa63e710c7
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
2025-01-26 00:42:37 -08:00
Roger Wang
2a0309a646
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-26 05:00:31 +00:00
Siyuan Liu
324960a95c
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-01-25 07:23:03 +00:00
Isotr0py
f1fc0510df
[Misc] Add FA2 support to ViT MHA layer (#12355)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-25 15:07:35 +08:00
Divakar Verma
bf21481dde
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-25 12:17:19 +08:00
Cyrus Leung
fb30ee92ee
[Bugfix] Fix BLIP-2 processing (#12412)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-25 11:42:42 +08:00
ElizaWszola
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413) 2025-01-25 01:49:28 +00:00
Lucas Wilkinson
3132a933b6
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 20:20:59 +00:00
Cyrus Leung
df5dafaa5b
[Misc] Remove deprecated code (#12383)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-24 14:45:20 -05:00
Lucas Wilkinson
ab5bbf5ae3
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 15:27:59 +00:00
Junichi Sato
3bb8e2c9a2
[Misc] Enable proxy support in benchmark script (#12356)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
2025-01-24 14:58:26 +00:00
youkaichao
e784c6b998
[ci/build] sync default value for wheel size (#12398)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 17:54:29 +08:00
Mohit Deopujari
9a0f3bdbe5
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382) 2025-01-24 09:43:49 +00:00
youkaichao
c7c9851036
[ci/build] fix wheel size check (#12396)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 17:31:25 +08:00
Roger Wang
3c818bdb42
[Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-24 00:22:04 -08:00
youkaichao
6dd94dbe94
[perf] fix perf regression from #12253 (#12380)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 11:34:27 +08:00
Woosuk Kwon
0e74d797ce
[V1] Increase default batch size for H100/H200 (#12369)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-24 03:19:55 +00:00
Dipika Sikka
55ef66edf4
Update compressed-tensors version (#12367) 2025-01-24 11:19:42 +08:00
omer-dayan
5e5630a478
[Bugfix] Path join when building local path for S3 clone (#12353)
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
2025-01-24 11:06:07 +08:00
Russell Bryant
d3d6bb13fb
Set weights_only=True when using torch.load() (#12366)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-24 02:17:30 +00:00
Nick Hill
24b0205f58
[V1][Frontend] Coalesce bunched RequestOutputs (#12298)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2025-01-23 17:17:41 -08:00
Russell Bryant
c5cffcd0cd
[Docs] Update spec decode + structured output in compat matrix (#12373)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-24 01:15:52 +00:00
Woosuk Kwon
682b55bc07
[Docs] Add meetup slides (#12345)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-23 14:10:03 -08:00
Junichi Sato
9726ad676d
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
2025-01-23 17:02:13 -05:00
Dipika Sikka
eb5cb5e528
[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order (#11528)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-23 21:40:33 +00:00
Isotr0py
2cbeedad09
[Docs] Document Phi-4 support (#12362)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 19:18:51 +00:00
Siyuan Liu
2c85529bfc
[TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-01-23 18:50:16 +00:00
Gregory Shtrasberg
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2025-01-23 18:04:03 +00:00
youkaichao
6e650f56a1
[torch.compile] decouple compile sizes and cudagraph sizes (#12243)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 02:01:30 +08:00