Woosuk Kwon
|
f0d4e14557
|
Add fused top-K softmax kernel for MoE (#2769)
|
2024-02-05 17:38:02 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
Pernekhan Utemuratov
|
c410f5d020
|
Use revision when downloading the quantization config file (#2697)
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
|
2024-02-01 15:41:58 -08:00 |
|
Fengzhe Zhou
|
cd9e60c76c
|
Add Internlm2 (#2666)
|
2024-02-01 09:27:40 -08:00 |
|
Philipp Moritz
|
d0d93b92b1
|
Add unit test for Mixtral MoE layer (#2677)
|
2024-01-31 14:34:17 -08:00 |
|
Woosuk Kwon
|
3dad944485
|
Add quantized mixtral support (#2673)
|
2024-01-30 16:34:10 -08:00 |
|
Woosuk Kwon
|
105a40f53a
|
[Minor] Fix false warning when TP=1 (#2674)
|
2024-01-30 14:39:40 -08:00 |
|
Philipp Moritz
|
bbe9bd9684
|
[Minor] Fix a small typo (#2672)
|
2024-01-30 13:40:37 -08:00 |
|
Philipp Moritz
|
ab40644669
|
Fused MOE for Mixtral (#2542)
Co-authored-by: chen shen <scv119@gmail.com>
|
2024-01-29 22:43:37 -08:00 |
|
wangding zeng
|
5d60def02c
|
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-01-29 21:19:48 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Hanzhi Zhou
|
380170038e
|
Implement custom all reduce kernels (#2192)
|
2024-01-27 12:46:35 -08:00 |
|
Casper
|
beb89f68b4
|
AWQ: Up to 2.66x higher throughput (#2566)
|
2024-01-26 23:53:17 -08:00 |
|
dakotamahan-stability
|
3a0e1fc070
|
Support for Stable LM 2 (#2598)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-26 12:45:19 -08:00 |
|
Junyang Lin
|
2832e7b9f9
|
fix names and license for Qwen2 (#2589)
|
2024-01-24 22:37:51 -08:00 |
|
Antoni Baum
|
9b945daaf1
|
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-01-23 15:26:37 -08:00 |
|
Junyang Lin
|
94b5edeb53
|
Add qwen2 (#2495)
|
2024-01-22 14:34:21 -08:00 |
|
Cade Daniel
|
18bfcdd05c
|
[Speculative decoding 2/9] Multi-step worker for draft model (#2424)
|
2024-01-21 16:31:47 -08:00 |
|
Junda Chen
|
5b23c3f26f
|
Add group as an argument in broadcast ops (#2522)
|
2024-01-20 16:00:26 -08:00 |
|
Roy
|
91a61da9b1
|
[Bugfix] fix load local safetensors model (#2512)
|
2024-01-19 16:26:16 -08:00 |
|
Zhuohan Li
|
ef9b636e2d
|
Simplify broadcast logic for control messages (#2501)
|
2024-01-19 11:23:30 -08:00 |
|
Simon Mo
|
dd7e8f5f64
|
refactor complemention api for readability (#2499)
|
2024-01-18 16:45:14 -08:00 |
|
Nikola Borisov
|
7e1081139d
|
Don't download both safetensor and bin files. (#2480)
|
2024-01-18 11:05:53 -08:00 |
|
YingchaoX
|
8a25d3a71a
|
fix stablelm.py tensor-parallel-size bug (#2482)
|
2024-01-18 09:39:46 -08:00 |
|
shiyi.c_98
|
d10f8e1d43
|
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-17 16:32:10 -08:00 |
|
Hyunsung Lee
|
e1957c6ebd
|
Add StableLM3B model (#2372)
|
2024-01-16 20:32:40 -08:00 |
|
Chenhui Zhang
|
f780504d12
|
fix weigit loading for GQA with TP (#2379)
|
2024-01-15 15:43:59 -08:00 |
|
陈序
|
218dc2ccda
|
Aligning top_p and top_k Sampling (#1885)
* Align top_p and top_k with huggingface
* remove _get_prompt_and_output_tokens
* rename _apply_top_p_top_k
* compare top_p top_k with hf
* fix test errors
|
2024-01-12 22:51:03 +01:00 |
|
Gary Hui
|
7878958c0d
|
Address Phi modeling update 2 (#2428)
|
2024-01-12 12:16:49 -08:00 |
|
Woosuk Kwon
|
50376faa7b
|
Rename phi_1_5 -> phi (#2385)
|
2024-01-11 16:23:43 -08:00 |
|
Cade Daniel
|
79d64c4954
|
[Speculative decoding 1/9] Optimized rejection sampler (#2336)
|
2024-01-09 15:38:41 -08:00 |
|
Woosuk Kwon
|
28c3f12104
|
[Minor] Remove unused code in attention (#2384)
|
2024-01-08 13:13:08 -08:00 |
|
Zhuohan Li
|
fd4ea8ef5c
|
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221)
|
2024-01-03 11:30:22 -08:00 |
|
Roy
|
9140561059
|
[Minor] Fix typo and remove unused code (#2305)
|
2024-01-02 19:23:15 -08:00 |
|
Jong-hun Shin
|
4934d49274
|
Support GPT-NeoX Models without attention biases (#2301)
|
2023-12-30 11:42:04 -05:00 |
|
Antoni Baum
|
bd29cf3d3a
|
Remove Sampler copy stream (#2209)
|
2023-12-20 00:04:33 -08:00 |
|
Woosuk Kwon
|
ba4f826738
|
[BugFix] Fix weight loading for Mixtral with TP (#2208)
|
2023-12-19 16:16:11 -08:00 |
|
avideci
|
de60a3fb93
|
Added DeciLM-7b and DeciLM-7b-instruct (#2062)
|
2023-12-19 02:29:33 -08:00 |
|
Woosuk Kwon
|
2c9b638065
|
[Minor] Fix a typo in .pt weight support (#2160)
|
2023-12-17 10:12:44 -08:00 |
|
Antoni Baum
|
a7347d9a6d
|
Make sampler less blocking (#1889)
|
2023-12-17 23:03:49 +08:00 |
|
Woosuk Kwon
|
c3372e87be
|
Remove dependency on CuPy (#2152)
|
2023-12-17 01:49:07 -08:00 |
|
Woosuk Kwon
|
37ca558103
|
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-16 21:12:08 -08:00 |
|
Roy
|
eed74a558f
|
Simplify weight loading logic (#2133)
|
2023-12-16 12:41:23 -08:00 |
|
CHU Tianxiang
|
0fbfc4b81b
|
Add GPTQ support (#916)
|
2023-12-15 03:04:22 -08:00 |
|
Antoni Baum
|
21d93c140d
|
Optimize Mixtral with expert parallelism (#2090)
|
2023-12-13 23:55:07 -08:00 |
|
Woosuk Kwon
|
518369d78c
|
Implement lazy model loader (#2044)
|
2023-12-12 22:21:45 -08:00 |
|
Megha Agarwal
|
6428f1d051
|
Support MPT with GQA (#1938)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2023-12-12 10:16:05 -08:00 |
|
Woosuk Kwon
|
cb3f30c600
|
Upgrade transformers version to 4.36.0 (#2046)
|
2023-12-11 18:39:14 -08:00 |
|
Woosuk Kwon
|
31d2ab4aff
|
Remove python 3.10 requirement (#2040)
|
2023-12-11 12:26:42 -08:00 |
|
Woosuk Kwon
|
6120e5aaea
|
Fix import error msg for megablocks (#2038)
|
2023-12-11 11:40:56 -08:00 |
|