4762 Commits

Author SHA1 Message Date
Nick Hill
cbae7af552
[V1][BugFix] Fix engine core client shutdown hangs (#13298)
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.

Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.

Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-23 13:07:43 -08:00
youkaichao
eb24dc4a45
[v1] torchrun compatibility (#13642)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-23 22:47:24 +08:00
Roger Wang
9bebc9512f
[Misc] Deprecate --dataset from benchmark_serving.py (#13708)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-23 13:32:20 +00:00
Nick Hill
5a2ba16f5c
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms (#13688) 2025-02-23 02:54:29 -08:00
Isotr0py
ba5106e519
[LMM] Implement merged multimodal processor for whisper (#13278) 2025-02-23 01:46:03 -08:00
Kyle Sayers
d5ca2110f1
[Quant] BaiChuan SupportsQuant (#13710) 2025-02-22 19:21:15 -08:00
Kevin H. Luu
2c5e637b57
[ci] Use env var to control whether to use S3 bucket in CI (#13634) 2025-02-22 19:19:45 -08:00
Andy Lo
322d2a27d6
[BugFix] Minor: logger import in attention backend (#13706)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-02-22 16:51:13 -08:00
Roger Wang
82e0d601fc
[CI/Build] Fix pre-commit errors from #13571 (#13709)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-22 16:50:38 -08:00
Daniele
78ac0f591d
[CI/Build] fix uv caching in Dockerfile (#13611) 2025-02-22 08:25:20 -08:00
Yan Ma
b56155e7f3
[XPU]fix setuptools version for xpu (#13548) 2025-02-22 08:05:35 -08:00
Helena Kloosterman
382f66fb08
[Bugfix] Fix boolean conversion for OpenVINO env variable (#13615) 2025-02-22 08:04:12 -08:00
Cyrus Leung
8354f6640c
[Doc] Dockerfile instructions for optional dependencies and dev transformers (#13699) 2025-02-22 06:04:31 -08:00
Gregory Shtrasberg
c904fdddf6
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm (#13231) 2025-02-22 05:54:38 -08:00
Sage Moore
558db8083c
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095) 2025-02-22 05:25:41 -08:00
Kaixi Hou
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm (#13571) 2025-02-22 05:24:05 -08:00
Keyun Tong
8db1b9d0a1
Support SSL Key Rotation in HTTP Server (#13495) 2025-02-22 05:17:44 -08:00
youkaichao
2382ad29d1
[ci] fix linter (#13701)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 20:28:59 +08:00
youkaichao
3e472d882a
[core] set up data parallel communication (#13591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 19:28:59 +08:00
Cyrus Leung
7f6bae561c
[CI/Build] Fix pre-commit errors (#13696) 2025-02-22 00:31:26 -08:00
Jee Jee Li
105b8ce4c0
[Misc] Reduce LoRA-related static variable (#13166) 2025-02-22 00:21:30 -08:00
Mark McLoughlin
2cb8c1540e
[Metrics] Add --show-hidden-metrics-for-version CLI arg (#13295) 2025-02-22 00:20:45 -08:00
Mark McLoughlin
1cd981da4f
[V1][Metrics] Support vllm:cache_config_info (#13299) 2025-02-22 00:20:00 -08:00
Yu Chin Fabian Lim
fca20841c2
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size (#13660) 2025-02-22 00:19:10 -08:00
Jennifer Zhao
da31b5333e
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler (#13594)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-02-22 00:08:29 -08:00
Lu Fang
bb78fb318e
[v1] Support allowed_token_ids in v1 Sampler (#13210)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-22 14:13:05 +08:00
Robin
8aca27fa11
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len (#13691)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-02-22 14:10:38 +08:00
Dipika Sikka
95c617e04b
[Misc] Bump compressed-tensors (#13619) 2025-02-21 22:09:04 -08:00
Shane A
9a1f1da5d1
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA (#13687) 2025-02-21 22:07:45 -08:00
Gordon Wong
68d630a0c7
[ROCM] fix native attention function call (#13650) 2025-02-21 22:07:04 -08:00
Jun Duan
68d535ef44
[Misc] Capture and log the time of loading weights (#13666) 2025-02-21 22:06:34 -08:00
Robin
c6ed93860f
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… (#13672) 2025-02-21 22:05:28 -08:00
Keyun Tong
0ffdf8ce0c
[HTTP Server] Make model param optional in request (#13568) 2025-02-21 21:55:50 -08:00
Yuan Tang
8c0dd3d4df
docs: Add a note on full CI run in contributing guide (#13646) 2025-02-21 21:53:59 -08:00
Isotr0py
ada7c780d5
[Misc] Fix yapf linting tools etc not running on pre-commit (#13695)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-22 13:10:43 +08:00
Lucas Wilkinson
288cc6c234
[Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-21 15:30:12 -08:00
John Zheng
900edbfa48
fix typo of grafana dashboard, with correct datasource (#13668)
Signed-off-by: John Zheng <john.zheng@hp.com>
2025-02-21 18:21:05 +00:00
Isotr0py
b2c3fc5d65
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation (#13586) 2025-02-20 22:24:17 -08:00
leoneo
839b27c6cc
[Kernel]Add streamK for block-quantized CUTLASS kernels (#12978) 2025-02-20 22:14:24 -08:00
Kevin H. Luu
34ad27fe83
[ci] Fix metrics test model path (#13635) 2025-02-20 22:12:10 -08:00
Gabriel Marinho
1c3c975766
[FEATURE] Enables /score endpoint for embedding models (#12846) 2025-02-20 22:09:47 -08:00
Szymon Ożóg
1cdc88614a
Missing comment explaining VDR variable in GGUF kernels (#13290) 2025-02-20 22:06:54 -08:00
Nick Hill
31aa045c11
[V1][Sampler] Avoid an operation during temperature application (#13587) 2025-02-20 22:05:56 -08:00
Roger Wang
a30c093502
[Bugfix] Add mm_processor_kwargs to chat-related protocols (#13644) 2025-02-20 22:04:33 -08:00
Harry Mellor
c7b07a95a6
Use pre-commit to update requirements-test.txt (#13617) 2025-02-20 22:03:27 -08:00
Kaixi Hou
27a09dc52c
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant (#13632) 2025-02-20 22:01:48 -08:00
Edwin Hernandez
981f3c831e
[Misc] Adding script to setup ray for multi-node vllm deployments (#12913) 2025-02-20 21:16:40 -08:00
Kante Yin
44c33f01f3
Add llmaz as another integration (#13643)
Signed-off-by: kerthcet <kerthcet@gmail.com>
2025-02-21 03:52:40 +00:00
Lingfan Yu
33170081f1
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
2025-02-20 17:45:45 -08:00
Michael Goin
71face8540
[Bugfix] Fix max_num_batched_tokens for MLA (#13620)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-20 17:45:20 -08:00