Michael Goin
|
fbb5bd4cef
|
[TPU] Add example for profiling TPU inference (#12531)
Signed-off-by: mgoin <mgoin@redhat.com>
|
2025-01-29 03:16:47 +00:00 |
|
fenghuizhang
|
80fcc3ed1c
|
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482)
Signed-off-by: Fenghui Zhang <fhzhang@google.com>
|
2025-01-28 22:36:44 +00:00 |
|
Mark McLoughlin
|
c386c43ca3
|
[V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-28 22:07:22 +00:00 |
|
Harry Mellor
|
f26d790718
|
Do not run suggestion pre-commit hook multiple times (#12521)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-01-28 20:05:27 +00:00 |
|
Michael Goin
|
0f657bdc52
|
Replace missed warning_once for rerank API (#12472)
Signed-off-by: mgoin <michael@neuralmagic.com>
|
2025-01-28 19:06:32 +00:00 |
|
Mark McLoughlin
|
3fd1fb63ef
|
[V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-28 16:38:38 +00:00 |
|
Jun Duan
|
925d2f1908
|
[Doc] Fix typo for x86 CPU installation (#12514)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
|
2025-01-28 16:37:10 +00:00 |
|
Cyrus Leung
|
8f58a51358
|
[VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-28 16:25:05 +00:00 |
|
Sebastian Schoennenbeck
|
2079e43bee
|
[Core] Make raw_request optional in ServingCompletion (#12503)
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
|
2025-01-28 10:56:45 +00:00 |
|
Robert Shaw
|
e29d4358ef
|
[V1] Include Engine Version in Logs (#12496)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2025-01-28 08:27:41 +00:00 |
|
Roger Wang
|
8cbc424975
|
Update README.md with V1 alpha release (#12495)
|
2025-01-28 08:22:41 +00:00 |
|
Mengqing Cao
|
dd66fd2b01
|
[CI] fix pre-commit error (#12494)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
|
2025-01-28 06:11:05 +00:00 |
|
Gabriel Marinho
|
0f465ab533
|
[FEATURE] Enables offline /score for embedding models (#12021)
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>
|
2025-01-28 11:30:13 +08:00 |
|
Hossein Sarshar
|
23a7cbc88b
|
[CI/Build] Fixed the xla nightly issue report in #12451 (#12453)
|
2025-01-28 11:18:07 +08:00 |
|
Michael Goin
|
426a5c3625
|
Fix bad path in prometheus example (#12481)
Signed-off-by: mgoin <michael@neuralmagic.com>
|
2025-01-27 18:56:31 -07:00 |
|
Liangfu Chen
|
ddee88d0ff
|
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: Jiangfei Duan <jfduan@outlook.com>
|
2025-01-27 17:31:16 -08:00 |
|
Harry Mellor
|
823ab79633
|
Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-01-27 17:23:08 -07:00 |
|
Nicolò Lucchesi
|
6116ca8cd7
|
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
|
2025-01-27 13:38:35 -08:00 |
|
Bowen Wang
|
2bc3fbba0c
|
[FlashInfer] Upgrade to 0.2.0 (#11194)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-01-27 18:19:24 +00:00 |
|
Woosuk Kwon
|
3f1fc7425a
|
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-01-27 09:40:04 -08:00 |
|
Mark McLoughlin
|
01ba927040
|
[V1][Metrics] Add initial Prometheus logger (#12416)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-27 12:26:28 -05:00 |
|
Lucas Wilkinson
|
103bd17ac5
|
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-27 10:40:00 -05:00 |
|
Isotr0py
|
ce69f7f754
|
[Bugfix] Fix gpt2 GGUF inference (#12467)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-01-27 18:31:49 +08:00 |
|
Woosuk Kwon
|
624a1e4711
|
[V1][Minor] Minor optimizations for update_from_output (#12454)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-01-27 01:09:27 -08:00 |
|
Isotr0py
|
372bf0890b
|
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-01-27 07:25:30 +00:00 |
|
Cyrus Leung
|
5204ff5c3f
|
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
v0.7.0
|
2025-01-26 21:26:44 -08:00 |
|
Pooya Davoodi
|
0cc6b383d7
|
[Frontend] Support scores endpoint in run_batch (#12430)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
|
2025-01-27 04:30:17 +00:00 |
|
Woosuk Kwon
|
28e0750847
|
[V1] Avoid list creation in input preparation (#12457)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-01-26 19:57:56 -08:00 |
|
Yuan Tang
|
582cf78798
|
[DOC] Add link to vLLM blog (#12460)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
|
2025-01-27 03:46:19 +00:00 |
|
Kyle Mistele
|
0034b09ceb
|
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
Signed-off-by: Kyle Mistele <kyle@mistele.com>
|
2025-01-26 19:58:45 -07:00 |
|
Tyler Michael Smith
|
72bac73067
|
[Build/CI] Fix libcuda.so linkage (#12424)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-01-26 21:18:19 +00:00 |
|
Lucas Wilkinson
|
68f11149d8
|
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-26 11:09:34 -08:00 |
|
Tyler Michael Smith
|
72f4880425
|
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
|
2025-01-26 10:39:03 -08:00 |
|
Tyler Michael Smith
|
aa2cd2c43d
|
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2025-01-26 19:59:58 +08:00 |
|
Matthew Hendrey
|
9ddc35220b
|
[Frontend] generation_config.json for maximum tokens(#12242)
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-01-26 19:59:25 +08:00 |
|
Roger Wang
|
a5255270c3
|
[Misc] Revert FA on ViT #12355 and #12435 (#12445)
|
2025-01-26 03:56:34 -08:00 |
|
Roger Wang
|
0ee349b553
|
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
Signed-off-by: Roger Wang <ywang@roblox.com>
|
2025-01-26 00:47:42 -08:00 |
|
Keyun Tong
|
fa63e710c7
|
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
|
2025-01-26 00:42:37 -08:00 |
|
Roger Wang
|
2a0309a646
|
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
|
2025-01-26 05:00:31 +00:00 |
|
Siyuan Liu
|
324960a95c
|
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
|
2025-01-25 07:23:03 +00:00 |
|
Isotr0py
|
f1fc0510df
|
[Misc] Add FA2 support to ViT MHA layer (#12355)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-01-25 15:07:35 +08:00 |
|
Divakar Verma
|
bf21481dde
|
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
|
2025-01-25 12:17:19 +08:00 |
|
Cyrus Leung
|
fb30ee92ee
|
[Bugfix] Fix BLIP-2 processing (#12412)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-25 11:42:42 +08:00 |
|
ElizaWszola
|
221d388cc5
|
[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413)
|
2025-01-25 01:49:28 +00:00 |
|
Lucas Wilkinson
|
3132a933b6
|
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-24 20:20:59 +00:00 |
|
Cyrus Leung
|
df5dafaa5b
|
[Misc] Remove deprecated code (#12383)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-24 14:45:20 -05:00 |
|
Lucas Wilkinson
|
ab5bbf5ae3
|
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-24 15:27:59 +00:00 |
|
Junichi Sato
|
3bb8e2c9a2
|
[Misc] Enable proxy support in benchmark script (#12356)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
|
2025-01-24 14:58:26 +00:00 |
|
youkaichao
|
e784c6b998
|
[ci/build] sync default value for wheel size (#12398)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-24 17:54:29 +08:00 |
|
Mohit Deopujari
|
9a0f3bdbe5
|
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)
|
2025-01-24 09:43:49 +00:00 |
|