Woosuk Kwon
|
cb40b3ab6b
|
[Kernel] Add MoE Triton kernel configs for A100 40GB (#3700)
|
2024-03-28 15:26:24 -07:00 |
|
Roger Wang
|
ce567a2926
|
[Kernel] DBRX Triton MoE kernel H100 (#3692)
|
2024-03-28 10:05:34 -07:00 |
|
Woosuk Kwon
|
8267b06c30
|
[Kernel] Add Triton MoE kernel configs for DBRX on A100 (#3679)
|
2024-03-27 22:22:25 -07:00 |
|
Antoni Baum
|
3a243095e5
|
Optimize _get_ranks in Sampler (#3623)
|
2024-03-25 16:03:02 -07:00 |
|
Travis Johnson
|
c13ad1b7bd
|
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-03-25 10:14:26 -07:00 |
|
Swapnil Parekh
|
819924e749
|
[Core] Adding token ranks along with logprobs (#3516)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
|
2024-03-25 10:13:10 -07:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Woosuk Kwon
|
925f3332ca
|
[Core] Refactor Attention Take 2 (#3462)
|
2024-03-25 04:39:33 +00:00 |
|
Kunshang Ji
|
6d93d35308
|
[BugFix] tensor.get_device() -> tensor.device (#3604)
|
2024-03-24 19:01:13 -07:00 |
|
Zhuohan Li
|
e90fc21f2e
|
[Hardware][Neuron] Refactor neuron support (#3471)
|
2024-03-22 01:22:17 +00:00 |
|
SangBin Cho
|
3bbff9e5ab
|
Fix 1D query issue from _prune_hidden_states (#3539)
|
2024-03-21 08:49:06 +00:00 |
|
Roy
|
f1c0fc3919
|
Migrate logits computation and gather to model_runner (#3233)
|
2024-03-20 23:25:01 +00:00 |
|
SangBin Cho
|
6e435de766
|
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
|
2024-03-20 14:46:05 -07:00 |
|
Antoni Baum
|
426ec4ec67
|
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-20 14:45:08 -07:00 |
|
Enrique Shockwave
|
b983ba35bd
|
fix marlin config repr (#3414)
|
2024-03-14 16:26:19 -07:00 |
|
youkaichao
|
8fe8386591
|
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
|
2024-03-14 08:11:48 +00:00 |
|
Antoni Baum
|
c33afd89f5
|
Fix lint (#3388)
|
2024-03-13 13:56:49 -07:00 |
|
Terry
|
7e9bd08f60
|
Add batched RoPE kernel (#3095)
|
2024-03-13 13:45:26 -07:00 |
|
Hui Liu
|
ba8dc958a3
|
[Minor] Fix bias in if to remove ambiguity (#3259)
|
2024-03-13 09:16:55 -07:00 |
|
Woosuk Kwon
|
602358f8a8
|
Add kernel for GeGLU with approximate GELU (#3337)
|
2024-03-12 22:06:17 -07:00 |
|
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
|
Cade Daniel
|
8437bae6ef
|
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103)
|
2024-03-08 23:32:46 -08:00 |
|
Zhuohan Li
|
f48c6791b7
|
[FIX] Fix prefix test error on main (#3286)
|
2024-03-08 17:16:14 -08:00 |
|
Woosuk Kwon
|
1cb0cc2975
|
[FIX] Make flash_attn optional (#3269)
|
2024-03-08 10:52:20 -08:00 |
|
Woosuk Kwon
|
2daf23ab0c
|
Separate attention backends (#3005)
|
2024-03-07 01:45:50 -08:00 |
|
Nick Hill
|
8999ec3c16
|
Store eos_token_id in Sequence for easy access (#3166)
|
2024-03-05 15:35:43 -08:00 |
|
Antoni Baum
|
22de45235c
|
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-03-04 19:54:06 +00:00 |
|
Robert Shaw
|
c0c2335ce0
|
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
|
2024-03-01 12:47:51 -08:00 |
|
CHU Tianxiang
|
01a5d18a53
|
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330)
|
2024-02-28 21:52:23 -08:00 |
|
Liangfu Chen
|
3b7178cfa4
|
[Neuron] Support inference with transformers-neuronx (#2569)
|
2024-02-28 09:34:34 -08:00 |
|
Tao He
|
71bcaf99e2
|
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
|
2024-02-27 01:14:31 -08:00 |
|
Woosuk Kwon
|
4bd18ec0c7
|
[Minor] Fix type annotation in fused moe (#3045)
|
2024-02-26 19:44:29 -08:00 |
|
Philipp Moritz
|
cfc15a1031
|
Optimize Triton MoE Kernel (#2979)
Co-authored-by: Cade Daniel <edacih@gmail.com>
|
2024-02-26 13:48:56 -08:00 |
|
Woosuk Kwon
|
f7c1234990
|
[Fix] Fissertion on YaRN model len (#2984)
|
2024-02-23 12:57:48 -08:00 |
|
44670
|
c530e2cfe3
|
[FIX] Fix a bug in initializing Yarn RoPE (#2983)
|
2024-02-22 01:40:05 -08:00 |
|
Woosuk Kwon
|
fd5dcc5c81
|
Optimize GeGLU layer in Gemma (#2975)
|
2024-02-21 20:17:52 -08:00 |
|
Massimiliano Pronesti
|
93dc5a2870
|
chore(vllm): codespell for spell checking (#2820)
|
2024-02-21 18:56:01 -08:00 |
|
Nick Hill
|
7d2dcce175
|
Support per-request seed (#2514)
|
2024-02-21 11:47:00 -08:00 |
|
shiyi.c_98
|
64da65b322
|
Prefix Caching- fix t4 triton error (#2517)
|
2024-02-16 14:17:55 -08:00 |
|
Philipp Moritz
|
ea356004d4
|
Revert "Refactor llama family models (#2637)" (#2851)
This reverts commit 5c976a7e1a1bec875bf6474824b7dff39e38de18.
|
2024-02-13 09:24:59 -08:00 |
|
Roy
|
5c976a7e1a
|
Refactor llama family models (#2637)
|
2024-02-13 00:09:23 -08:00 |
|
Rex
|
563836496a
|
Refactor 2 awq gemm kernels into m16nXk32 (#2723)
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
|
2024-02-12 11:02:17 -08:00 |
|
Hongxia Yang
|
0580aab02f
|
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention (#2768)
|
2024-02-10 23:14:37 -08:00 |
|
Woosuk Kwon
|
f0d4e14557
|
Add fused top-K softmax kernel for MoE (#2769)
|
2024-02-05 17:38:02 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
Philipp Moritz
|
d0d93b92b1
|
Add unit test for Mixtral MoE layer (#2677)
|
2024-01-31 14:34:17 -08:00 |
|
wangding zeng
|
5d60def02c
|
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-01-29 21:19:48 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Casper
|
beb89f68b4
|
AWQ: Up to 2.66x higher throughput (#2566)
|
2024-01-26 23:53:17 -08:00 |
|
Antoni Baum
|
9b945daaf1
|
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-01-23 15:26:37 -08:00 |
|