xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-05-29 02:07:04 +08:00

Author	SHA1	Message	Date
Woosuk Kwon	cb40b3ab6b	[Kernel] Add MoE Triton kernel configs for A100 40GB (#3700 )	2024-03-28 15:26:24 -07:00
Roger Wang	ce567a2926	[Kernel] DBRX Triton MoE kernel H100 (#3692 )	2024-03-28 10:05:34 -07:00
Woosuk Kwon	8267b06c30	[Kernel] Add Triton MoE kernel configs for DBRX on A100 (#3679 )	2024-03-27 22:22:25 -07:00
Antoni Baum	3a243095e5	Optimize `_get_ranks` in Sampler (#3623 )	2024-03-25 16:03:02 -07:00
Travis Johnson	c13ad1b7bd	feat: implement the min_tokens sampling parameter (#3124 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-03-25 10:14:26 -07:00
Swapnil Parekh	819924e749	[Core] Adding token ranks along with logprobs (#3516 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>	2024-03-25 10:13:10 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00
Kunshang Ji	6d93d35308	[BugFix] tensor.get_device() -> tensor.device (#3604 )	2024-03-24 19:01:13 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
SangBin Cho	3bbff9e5ab	Fix 1D query issue from `_prune_hidden_states` (#3539 )	2024-03-21 08:49:06 +00:00
Roy	f1c0fc3919	Migrate `logits` computation and gather to `model_runner` (#3233 )	2024-03-20 23:25:01 +00:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Antoni Baum	426ec4ec67	[1/n] Triton sampling kernel (#3186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-20 14:45:08 -07:00
Enrique Shockwave	b983ba35bd	fix marlin config repr (#3414 )	2024-03-14 16:26:19 -07:00
youkaichao	8fe8386591	[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389 )	2024-03-14 08:11:48 +00:00
Antoni Baum	c33afd89f5	Fix lint (#3388 )	2024-03-13 13:56:49 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Hui Liu	ba8dc958a3	[Minor] Fix bias in if to remove ambiguity (#3259 )	2024-03-13 09:16:55 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Cade Daniel	8437bae6ef	[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103 )	2024-03-08 23:32:46 -08:00
Zhuohan Li	f48c6791b7	[FIX] Fix prefix test error on main (#3286 )	2024-03-08 17:16:14 -08:00
Woosuk Kwon	1cb0cc2975	[FIX] Make `flash_attn` optional (#3269 )	2024-03-08 10:52:20 -08:00
Woosuk Kwon	2daf23ab0c	Separate attention backends (#3005 )	2024-03-07 01:45:50 -08:00
Nick Hill	8999ec3c16	Store `eos_token_id` in `Sequence` for easy access (#3166 )	2024-03-05 15:35:43 -08:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
CHU Tianxiang	01a5d18a53	Add Support for 2/3/8-bit GPTQ Quantization Models (#2330 )	2024-02-28 21:52:23 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Woosuk Kwon	4bd18ec0c7	[Minor] Fix type annotation in fused moe (#3045 )	2024-02-26 19:44:29 -08:00
Philipp Moritz	cfc15a1031	Optimize Triton MoE Kernel (#2979 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-02-26 13:48:56 -08:00
Woosuk Kwon	f7c1234990	[Fix] Fissertion on YaRN model len (#2984 )	2024-02-23 12:57:48 -08:00
44670	c530e2cfe3	[FIX] Fix a bug in initializing Yarn RoPE (#2983 )	2024-02-22 01:40:05 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
shiyi.c_98	64da65b322	Prefix Caching- fix t4 triton error (#2517 )	2024-02-16 14:17:55 -08:00
Philipp Moritz	ea356004d4	Revert "Refactor llama family models (#2637 )" (#2851 ) This reverts commit 5c976a7e1a1bec875bf6474824b7dff39e38de18.	2024-02-13 09:24:59 -08:00
Roy	5c976a7e1a	Refactor llama family models (#2637 )	2024-02-13 00:09:23 -08:00
Rex	563836496a	Refactor 2 awq gemm kernels into m16nXk32 (#2723 ) Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>	2024-02-12 11:02:17 -08:00
Hongxia Yang	0580aab02f	[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768 )	2024-02-10 23:14:37 -08:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
Philipp Moritz	d0d93b92b1	Add unit test for Mixtral MoE layer (#2677 )	2024-01-31 14:34:17 -08:00
wangding zeng	5d60def02c	DeepseekMoE support with Fused MoE kernel (#2453 ) Co-authored-by: roy <jasonailu87@gmail.com>	2024-01-29 21:19:48 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Casper	beb89f68b4	AWQ: Up to 2.66x higher throughput (#2566 )	2024-01-26 23:53:17 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00

... 22 23 24 25 26

1274 Commits