vllm/csrc at 68a72a5cc1e29198730d1b2471e23675d9b964dd - vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-07-21 14:17:18 +08:00

History

[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439 )

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

2025-11-07 04:18:39 -08:00

attention

[Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (#28181 )

2025-11-06 17:52:13 +08:00

core

[small][batch invariance] Rename the env and internal flags to simplify usage (#26855 )

2025-10-16 21:40:25 +00:00

cpu

[cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (#27472 )

2025-10-24 15:57:48 +00:00

cutlass_extensions

Update Optional[x] -> x | None and Union[x, y] to x | y (#26633 )

2025-10-12 09:51:31 -07:00

mamba/mamba_ssm

[V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377 )

2025-11-02 04:16:23 -08:00

moe

[CPU]Improve dynamic 4bit moe performance (#27240 )

2025-11-04 06:33:23 +00:00

quantization

[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439 )

2025-11-07 04:18:39 -08:00

quickreduce

[Bugfix][Rocm] fix qr error when different inp shape (#25892 )

2025-10-13 10:04:21 -07:00

rocm

[Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (#25293 )

2025-10-08 10:20:48 -04:00

sparse/cutlass

[feat]: CUTLASS block scaled group gemm for SM100 (#19757 )

2025-07-04 12:58:04 -06:00

activation_kernels.cu

[Kernel] Add cuda kernel for gpt_oss activation (#22951 )

2025-08-17 05:03:24 +00:00

cache_kernels.cu

[Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (#25293 )

2025-10-08 10:20:48 -04:00

cache.h

Add gather_indexer_k_quant_cache kernel (#25931 )

2025-10-08 04:58:57 +00:00

cub_helpers.h

[Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (#25293 )

2025-10-08 10:20:48 -04:00

cuda_compat.h

[Bugfix][ROCm] Fix for warp_size uses on host (#21205 )

2025-07-24 00:37:19 -07:00

cuda_utils_kernels.cu

[NVIDIA] Support nvfp4 quantization (#12784 )

2025-02-12 19:51:51 -08:00

cuda_utils.h

[Attention] MLA with chunked prefill (#12639 )

2025-02-21 15:30:12 -08:00

cuda_view.cu

[V1] Fully Transparent Implementation of CPU Offloading (#15354 )

2025-03-31 20:22:34 +08:00

cumem_allocator.cpp

[core] improve error handling when wake up from sleep mode (#12981 )

2025-02-10 09:38:57 +08:00

custom_all_reduce_test.cu

[Distributed] Add custom allreduce support for ROCM (#14125 )

2025-03-31 22:49:12 -07:00

custom_all_reduce.cu

[Distributed] Add custom allreduce support for ROCM (#14125 )

2025-03-31 22:49:12 -07:00

custom_all_reduce.cuh

[Kernels] Enable Torch Symmetric Memory All-Reduce By Default (#24111 )

2025-09-11 09:45:31 -07:00

custom_quickreduce.cu

[Feature] add quick all reduce (#19744 )

2025-06-26 20:54:24 -07:00

dispatch_utils.h

[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files (#23727 )

2025-09-04 14:25:45 -07:00

launch_bounds_utils.h

Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (#25843 )

2025-09-30 19:18:19 -07:00

layernorm_kernels.cu

[Chore] Remove unused PolyNorm layer (#27110 )

2025-10-17 19:03:43 +00:00

layernorm_quant_kernels.cu

[torch.compile] Enable attention and allreduce fusion without custom ops enabled (#24604 )

2025-10-17 08:10:23 -06:00

ops.h

[V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377 )

2025-11-02 04:16:23 -08:00

permute_cols.cu

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 )

2024-09-23 13:46:26 -04:00

pos_encoding_kernels.cu

[Chore] Remove unused batched RoPE op & kernel (#24789 )

2025-09-13 00:08:20 -07:00

sampler.cu

[Deepseek v3.2] Remove extra logics in indexer (#26465 )

2025-10-21 23:34:03 +00:00

torch_bindings.cpp

[V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377 )

2025-11-02 04:16:23 -08:00

type_convert.cuh

[torch.compile] Fuse RMSNorm with quant (#9138 )

2024-11-08 21:20:08 +00:00