vllm/csrc at 655a09f6538e6b09af23771dcc4fcebd72a15b23 - vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-07-10 10:47:09 +08:00

History

[Compile] Fix Compile Warning SM100 Cutlass MLA (#23287 )

Signed-off-by: yewentao256 <zhyanwentao@126.com>

2025-08-21 03:09:39 +00:00

attention

[Compile] Fix Compile Warning SM100 Cutlass MLA (#23287 )

2025-08-21 03:09:39 +00:00

core

[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel (#22428 )

2025-08-14 11:23:22 -07:00

cpu

[CPU] Refactor CPU W8A8 scaled_mm (#23071 )

2025-08-21 09:34:24 +08:00

cutlass_extensions

[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) (#22131 )

2025-08-07 19:18:28 -07:00

mamba/mamba_ssm

[v1] - Mamba1 Attention Metadata (#21249 )

2025-08-06 17:03:42 -07:00

moe

[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute (#23045 )

2025-08-20 10:35:26 -04:00

quantization

[Kernel/Quant] Remove the original marlin format and qqq (#23204 )

2025-08-20 15:13:36 -04:00

quickreduce

[Feature] add quick all reduce (#19744 )

2025-06-26 20:54:24 -07:00

rocm

[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel (#22097 )

2025-08-08 23:15:06 -07:00

sparse/cutlass

[feat]: CUTLASS block scaled group gemm for SM100 (#19757 )

2025-07-04 12:58:04 -06:00

activation_kernels.cu

[Kernel] Add cuda kernel for gpt_oss activation (#22951 )

2025-08-17 05:03:24 +00:00

cache_kernels.cu

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (#22036 )

2025-08-01 19:18:51 -04:00

cache.h

[Attention] MLA with chunked prefill (#12639 )

2025-02-21 15:30:12 -08:00

cuda_compat.h

[Bugfix][ROCm] Fix for warp_size uses on host (#21205 )

2025-07-24 00:37:19 -07:00

cuda_utils_kernels.cu

[NVIDIA] Support nvfp4 quantization (#12784 )

2025-02-12 19:51:51 -08:00

cuda_utils.h

[Attention] MLA with chunked prefill (#12639 )

2025-02-21 15:30:12 -08:00

cuda_view.cu

[V1] Fully Transparent Implementation of CPU Offloading (#15354 )

2025-03-31 20:22:34 +08:00

cumem_allocator.cpp

[core] improve error handling when wake up from sleep mode (#12981 )

2025-02-10 09:38:57 +08:00

custom_all_reduce_test.cu

[Distributed] Add custom allreduce support for ROCM (#14125 )

2025-03-31 22:49:12 -07:00

custom_all_reduce.cu

[Distributed] Add custom allreduce support for ROCM (#14125 )

2025-03-31 22:49:12 -07:00

custom_all_reduce.cuh

fix: spelling (#16466 )

2025-04-11 23:24:22 -07:00

custom_quickreduce.cu

[Feature] add quick all reduce (#19744 )

2025-06-26 20:54:24 -07:00

dispatch_utils.h

Modularize fused experts and integrate PPLX kernels (#15956 )

2025-05-14 13:11:54 -07:00

layernorm_kernels.cu

[perf] Add fused MLA QKV + strided layernorm (#21116 )

2025-07-22 07:07:44 -07:00

layernorm_quant_kernels.cu

[perf] Add fused MLA QKV + strided layernorm (#21116 )

2025-07-22 07:07:44 -07:00

ops.h

[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute (#23045 )

2025-08-20 10:35:26 -04:00

permute_cols.cu

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 )

2024-09-23 13:46:26 -04:00

pos_encoding_kernels.cu

[Kernel] Have rotary embeddings support tensors (#18046 )

2025-05-14 15:43:55 -07:00

sampler.cu

[BUG] Fix #20484 . Support empty sequence in cuda penalty kernel (#20491 )

2025-07-05 19:38:02 -07:00

torch_bindings.cpp

[Kernel/Quant] Remove the original marlin format and qqq (#23204 )

2025-08-20 15:13:36 -04:00

type_convert.cuh

[torch.compile] Fuse RMSNorm with quant (#9138 )

2024-11-08 21:20:08 +00:00