xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-10 09:25:44 +08:00

Author	SHA1	Message	Date
Woosuk Kwon	73001445fb	[V1] Implement Cascade Attention (#11635 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-01 21:56:46 +09:00
youkaichao	b12e87f942	[platforms] enable platform plugins (#11602 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-12-30 20:24:45 +08:00
Michael Goin	2072924d14	[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523 ) Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: HandH1998 <1335248067@qq.com>	2024-12-26 15:33:30 -08:00
Tyler Michael Smith	5a9da2e6e9	[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-12-19 02:43:30 +00:00
Dipika Sikka	60508ffda9	[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995 ) Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com> Co-authored-by: ilmarkov <markovilya197@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2024-12-18 09:57:16 -05:00
Luka Govedič	30870b4f66	[torch.compile] Dynamic fp8 + rms_norm fusion (#10906 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-12-13 03:19:23 +00:00
zhou fan	78029b34ed	[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928 ) Signed-off-by: xffxff <1247714429@qq.com>	2024-12-08 01:21:18 +08:00
Woosuk Kwon	073a4bd1c0	[Kernel] Use `out` arg in flash_attn_varlen_func (#10811 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-12-01 17:55:39 -08:00
Wallas Henrique	c27df94e1f	[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850 ) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-11-25 12:23:32 -05:00
youkaichao	05d1f8c9c6	[misc] move functions to config.py (#10624 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-25 09:27:30 +00:00
youkaichao	eebad39f26	[torch.compile] support all attention backends (#10558 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-11-22 14:04:42 -08:00
Lucas Wilkinson	d200972e7f	[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-19 19:40:33 -08:00
ElizaWszola	b00b33d77e	[Model][Quantization] HQQ support through Marlin kernel expansion (#9766 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-19 13:31:12 -08:00
Mengqing Cao	8c1fb50705	[Platform][Refactor] Extract func `get_default_attn_backend` to `Platform` (#10358 ) Signed-off-by: Mengqing Cao <cmq0113@163.com>	2024-11-19 11:22:26 +08:00
Lucas Wilkinson	96d999fbe8	[Kernel] Initial Machete W4A8 support + Refactors (#9855 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-11-18 12:59:29 -07:00
ElizaWszola	79ee45b428	[Misc] Bump up test_fused_moe tolerance (#10364 ) Signed-off-by: ElizaWszola <eliza@neuralmagic.com>	2024-11-15 16:31:18 +00:00
Luka Govedič	bf2ddc6610	[bugfix] Fix static asymmetric quantization case (#10334 ) Signed-off-by: Daniël de Kok <me@danieldk.eu> Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Daniël de Kok <me@danieldk.eu>	2024-11-15 09:35:11 +08:00
rasmith	127c07480e	[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2024-11-08 19:59:22 -05:00
Luka Govedič	4f93dfe952	[torch.compile] Fuse RMSNorm with quant (#9138 ) Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@126.com>	2024-11-08 21:20:08 +00:00
Joe Runde	d58268c56a	[V1] Make v1 more testable (#9888 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-11-06 11:57:35 -08:00
Aaron Pham	21063c11c7	[CI/Build] drop support for Python 3.8 EOL (#8464 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>	2024-11-06 07:11:55 +00:00
Michael Goin	235366fe2e	[CI] Prune back the number of tests in tests/kernels/* (#9932 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2024-11-05 16:02:32 -05:00
sroy745	a78dd3303e	[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559 )	2024-11-01 23:22:49 -07:00
Peter Salas	6c0b7f548d	[Core][VLM] Add precise multi-modal placeholder tracking (#8346 ) Signed-off-by: Peter Salas <peter@fixie.ai>	2024-11-01 16:21:10 -07:00
Pavani Majety	598b6d7b07	[Bugfix/Core] Flashinfer k_scale and v_scale (#9861 )	2024-11-01 12:15:05 -07:00
Mor Zusman	9fb12f7848	[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838 ) Signed-off-by: mzusman <mor.zusmann@gmail.com>	2024-10-31 20:06:25 +00:00
wangshuai09	622b7ab955	[Hardware] using current_platform.seed_everything (#9785 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-29 14:47:44 +00:00
youkaichao	32176fee73	[torch.compile] support moe models (#9632 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2024-10-27 21:58:04 -07:00
wangshuai09	4e2d95e372	[Hardware][ROCM] using current_platform.is_rocm (#9642 ) Signed-off-by: wangshuai09 <391746016@qq.com>	2024-10-28 04:07:00 +00:00
Mengqing Cao	5cbdccd151	[Hardware][openvino] is_openvino --> current_platform.is_openvino (#9716 )	2024-10-26 10:59:06 +00:00
Charlie Fu	59449095ab	[Performance][Kernel] Fused_moe Performance Improvement (#9384 ) Signed-off-by: charlifu <charlifu@amd.com>	2024-10-24 15:37:52 -07:00
Jee Jee Li	295a061fb3	[Kernel] add kernel for FATReLU (#9610 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2024-10-24 16:18:27 +08:00
wangshuai09	3ddbe25502	[Hardware][CPU] using current_platform.is_cpu (#9536 )	2024-10-22 00:50:43 -07:00
Chen Zhang	4fa3e33349	[Kernel] Support sliding window in flash attention backend (#9403 )	2024-10-20 10:57:52 -07:00
bnellnm	eca2c5f7c0	[Bugfix] Fix support for dimension like integers and ScalarType (#9299 )	2024-10-17 19:08:34 +00:00
Mor Zusman	fb60ae9b91	[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189 )	2024-10-16 12:12:43 -04:00
Cyrus Leung	7e7eae338d	[Misc] Standardize RoPE handling for Qwen2-VL (#9250 )	2024-10-16 13:56:17 +08:00
Tyler Michael Smith	7342a7d7f8	[Model] Support Mamba (#6484 )	2024-10-11 15:40:06 +00:00
Lucas Wilkinson	a64e7b9407	[Bugfix] Machete garbage results for some models (large K dim) (#9212 )	2024-10-10 14:16:17 +08:00
bnellnm	bd37b9fbe2	[Bugfix] Try to handle older versions of pytorch (#9086 )	2024-10-08 14:28:12 -07:00
ElizaWszola	05d686432f	[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973 ) Co-authored-by: Dipika <dipikasikka1@gmail.com> Co-authored-by: Dipika Sikka <ds3822@columbia.edu>	2024-10-04 12:34:44 -06:00
youkaichao	9aaf14c62e	[misc] add forward context for attention (#9029 )	2024-10-03 12:09:42 -07:00
Mor Zusman	f13a07b1f8	[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533 )	2024-09-29 17:35:58 -04:00
ElizaWszola	d081da0064	[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741 ) Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-28 18:19:40 -07:00
youkaichao	a9b15c606f	[torch.compile] use empty tensor instead of None for profiling (#8875 )	2024-09-27 08:11:32 -07:00
bnellnm	300da09177	[Kernel] Fullgraph and opcheck tests (#8479 )	2024-09-25 08:35:52 -06:00
Lucas Wilkinson	86e9c8df29	[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2024-09-23 13:46:26 -04:00
Charlie Fu	9cc373f390	[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577 )	2024-09-19 17:37:57 +00:00
Tyler Michael Smith	db9120cded	[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039 )	2024-09-18 20:05:06 +00:00
Cyrus Leung	6ffa3f314c	[CI/Build] Avoid CUDA initialization (#8534 )	2024-09-18 10:38:11 +00:00

1 2 3 4 5

205 Commits