xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-21 11:55:51 +08:00

Author	SHA1	Message	Date
Shu Wang	a3b9c17b56	Support Tensorrt-LLM MoE fp4 for low-latency (#21331 ) Signed-off-by: Shu Wang <shuw@nvidia.com> Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: XIn Li <xinli@nvidia.com> Co-authored-by: XIn Li <xinli@nvidia.com>	2025-08-07 19:18:22 -07:00
Cyrus Leung	139d155781	[Frontend] Use engine argument to control MM cache size (#22441 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-07 09:47:10 -07:00
Cyrus Leung	766bc8162c	[Core] Store only the keys for multi-modal data in P0 (#22198 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-08-07 01:45:04 -07:00
Ming Yang	82216dc21f	[Misc] Support routing logic simulation (#21990 ) Signed-off-by: Ming Yang <minos.future@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-06 23:06:20 -07:00
Lain	9a3835aaa9	Fix trtllm-gen attention env and add attention sink (#22378 ) Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: Lain <fusiyuan2000@hotmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>	2025-08-06 18:07:41 -07:00
Yongye Zhu	31f09c615f	[gpt-oss] flashinfer mxfp4 (#22339 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-08-06 12:37:27 -07:00
Woosuk Kwon	6e20924350	Add attention sink in attention backends (#22320 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>	2025-08-05 22:37:21 -07:00
Wentao Ye	ae87ddd040	[Refactor] Remove Unused Environment Variable `VLLM_NO_DEPRECATION_WARNING` (#22199 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-08-05 09:40:23 -07:00
elvischenv	83156c7b89	[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2025-08-05 02:45:34 -07:00
Woosuk Kwon	9af654cc38	[Responses API] Ignore `store=True` and process the request by default (#22185 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-08-04 05:12:48 -07:00
Woosuk Kwon	6d98843b31	[Responses API] Disable response store by default (#22137 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-08-03 04:04:21 -07:00
Varun Sundar Rabindranath	a65f46be5e	[Misc] DeepGemmExperts : Avoid JIT generation in the hot-path (#21955 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-08-01 19:42:03 -07:00
Nicolò Lucchesi	57393715e8	[Misc] `VLLM_TARGET_DEVICE.lower()` (#22101 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-08-01 19:41:40 -07:00
Rui Qiao	d331759488	Introduce RayPPCommunicator for ray-based PP (#21660 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2025-08-01 11:50:58 -07:00
Simon Mo	da31f6ad3d	Revert precompile wheel changes (#22055 )	2025-08-01 08:26:24 +00:00
wenxindongwork	8f0d516715	[TPU] Support Pathways in vLLM (#21417 ) Signed-off-by: wenxindongwork <wenxindong@google.com>	2025-07-30 10:02:12 -07:00
youkaichao	e91d3c9cda	[misc] skip p2p check by default (#21904 )	2025-07-30 22:05:04 +08:00
Csrayz	b917da442b	Expose PyTorch profiler configuration to environment variables (#21803 ) Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com>	2025-07-29 19:46:31 -07:00
Doug Smith	a1873db23d	docker: docker-aware precompiled wheel support (#21127 ) Signed-off-by: dougbtv <dosmith@redhat.com>	2025-07-29 14:45:19 -07:00
Lucas Wilkinson	8aa1485fcf	[Perf] Disable chunked local attention by default with llama4 (#21761 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-07-28 18:49:04 -04:00
Chauncey	6da0078523	[Feat] Allow custom naming of vLLM processes (#21445 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2025-07-24 03:15:23 -07:00
deven-labovitch	63d92abb7c	[Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding (#21374 ) Signed-off-by: Deven Labovitch <deven@videa.ai>	2025-07-23 20:22:19 -07:00
Michael Goin	f3137cdd81	[Core] Freeze gc during cuda graph capture to speed up init (#21146 ) Signed-off-by: Codex <codex@openai.com> Signed-off-by: mgoin <mgoin64@gmail.com>	2025-07-23 17:20:14 -07:00
Li, Jiang	a15a50fc17	[CPU] Enable shared-memory based pipeline parallel for CPU backend (#21289 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-07-21 09:07:08 -07:00
Li, Jiang	e3a0e43d7f	[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code (#21032 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-07-19 05:13:55 -07:00
Kaixi Hou	6d0734c562	[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency (#20645 ) Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-07-19 02:33:01 -07:00
Shu Wang	c7d8724e78	[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037 ) Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-07-17 21:32:45 -07:00
Woosuk Kwon	4de7146351	[V0 deprecation] Remove V0 HPU backend (#21131 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-07-17 16:37:36 -07:00
Elfie Guo	30800b01c2	[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill (#20411 ) Signed-off-by: Elfie Guo <elfieg@nvidia.com> Co-authored-by: Elfie Guo <eflieg@nvidia.com>	2025-07-15 17:56:45 -07:00
Chen LI	10be209493	[Bug Fix] get_distributed_init_method should get the ip from get_ip i… (#20889 ) Signed-off-by: Chen Li <lcpingping@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-07-15 21:23:52 +00:00
Boyuan Feng	91b3d190ae	[cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir (#20940 ) Signed-off-by: Boyuan Feng <boyuan@meta.com>	2025-07-15 13:02:17 +08:00
Boyuan Feng	c1c8ca57ff	[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile (#20790 ) Signed-off-by: Boyuan Feng <boyuan@meta.com>	2025-07-11 23:06:13 -07:00
Pavani Majety	7bd4c37ae7	[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-07-11 09:23:23 +00:00
Nathan Hoos	d6902ce79f	[V0][V1][Core] Add outlines integration for V1, and update V0 integration. (#15975 ) Signed-off-by: Nathan Hoos <thwackyy.y@gmail.com>	2025-07-10 15:30:26 -04:00
fxmarty-amd	332d4cb17b	[Feature][Quantization] MXFP4 support for MOE models (#17888 ) Signed-off-by: Felix Marty <felmarty@amd.com> Signed-off-by: Bowen Bao <bowenbao@amd.com> Signed-off-by: Felix Marty <Felix.Marty@amd.com> Co-authored-by: Bowen Bao <bowenbao@amd.com>	2025-07-09 13:19:02 -07:00
Nicolò Lucchesi	71d1d75b7a	[PD][Nixl] Remote consumer READ timeout for clearing request blocks (#20139 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-07-08 08:56:40 +01:00
Wentao Ye	9dae7d46bf	[Refactor] Remove Unused Env `VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON` (#20334 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2025-07-01 19:03:43 -07:00
Li, Jiang	6cc1e7d96d	[CPU] Update custom ops for the CPU backend (#20255 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-07-01 07:25:03 +00:00
li haoyang	0740e29b66	[Feature] add quick all reduce (#19744 ) Signed-off-by: ilmarkov <imarkov@redhat.com> Signed-off-by: Haoyang Li <Haoyang.Li@amd.com> Co-authored-by: ilmarkov <imarkov@redhat.com>	2025-06-26 20:54:24 -07:00
Chenyaaang	2d7620c3eb	[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN (#19919 ) Signed-off-by: Chenyaaang <chenyangli@google.com>	2025-06-25 15:51:02 -07:00
Dipika Sikka	02c97d9a92	[Quantization] Add compressed-tensors emulations support for NVFP4 (#19879 ) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: Dipika <dipikasikka1@gmail.com>	2025-06-25 14:28:19 -04:00
bnellnm	015fab8c2f	[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. (#19717 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-06-24 23:22:58 -07:00
jinqinn	f39ab2d4bd	[Misc] Configurable timeout for execute_model RPC calls via env var (#19544 ) Signed-off-by: jinqinn <goodqinjin@163.com>	2025-06-22 20:36:26 -07:00
Ye (Charlotte) Qi	33d51f599e	[BugFix] Add an env to disable moe chunking to work around compile incompatibility (#19642 ) Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>	2025-06-22 15:17:49 -07:00
Vlad Tiberiu Mihailescu	2e3e3c86dc	Export NaNs in logits to scheduler_stats if output is corrupted (#18777 ) Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com>	2025-06-20 22:47:16 +08:00
Zzz9990	8b6e1d639c	[Hardware][AMD] integrate aiter chunked prefill into vllm (#18596 ) Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: charlifu <charlifu@amd.com> Co-authored-by: fsx950223 <fsx950223@outlook.com> Co-authored-by: charlifu <charlifu@amd.com>	2025-06-18 08:46:51 -07:00
Nicolò Lucchesi	4c8f64faa7	[V1][Kernel] Flashinfer HND KV cache layout (#19280 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-06-17 09:09:22 -04:00
Varun Sundar Rabindranath	9d880f594d	[Misc] Turn MOE_DP_CHUNK_SIZE into an env var (#19506 )	2025-06-12 18:01:16 +00:00
Luka Govedič	f98548b9da	[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (#16756 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Sage Moore <sage@neuralmagic.com>	2025-06-12 08:31:04 -07:00
Louie Tsai	9368cc90b2	Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. (#17930 ) Signed-off-by: Tsai, Louie <louie.tsai@intel.com> Co-authored-by: Li, Jiang <bigpyj64@gmail.com>	2025-06-10 06:22:05 +00:00

1 2 3 4 5

212 Commits