Woosuk Kwon
|
5ee14494e4
|
[Misc] Remove cache stream and cache events (#3461)
|
2024-03-20 00:38:53 -07:00 |
|
Cade Daniel
|
8437bae6ef
|
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103)
|
2024-03-08 23:32:46 -08:00 |
|
Hongxia Yang
|
05af6da8d9
|
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123)
Co-authored-by: lcskrishna <lollachaitanya@gmail.com>
|
2024-03-04 18:14:53 -08:00 |
|
Zhuohan Li
|
537c9755a7
|
[Minor] Small fix to make distributed init logic in worker looks cleaner (#2905)
|
2024-02-18 14:39:00 -08:00 |
|
Woosuk Kwon
|
25e86b6a61
|
Don't use cupy NCCL for AMD backends (#2855)
|
2024-02-14 12:30:44 -08:00 |
|
Woosuk Kwon
|
7e45107f51
|
[Fix] Fix memory profiling when GPU is used by multiple processes (#2863)
|
2024-02-13 19:52:34 -08:00 |
|
Woosuk Kwon
|
a463c333dd
|
Use CuPy for CUDA graphs (#2811)
|
2024-02-13 11:32:06 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Hanzhi Zhou
|
380170038e
|
Implement custom all reduce kernels (#2192)
|
2024-01-27 12:46:35 -08:00 |
|
Antoni Baum
|
9b945daaf1
|
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-01-23 15:26:37 -08:00 |
|
Cade Daniel
|
18bfcdd05c
|
[Speculative decoding 2/9] Multi-step worker for draft model (#2424)
|
2024-01-21 16:31:47 -08:00 |
|
Zhuohan Li
|
ef9b636e2d
|
Simplify broadcast logic for control messages (#2501)
|
2024-01-19 11:23:30 -08:00 |
|
Woosuk Kwon
|
35c4bc20d9
|
[Minor] Fix err msg (#2431)
|
2024-01-12 14:02:52 -08:00 |
|
Ben
|
cb7a1c1cbf
|
Suggest using dtype=half when OOM.
|
2024-01-12 12:33:29 -08:00 |
|
Jiaxiang
|
6549aef245
|
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011)
|
2024-01-11 19:26:49 -08:00 |
|
Zhuohan Li
|
fd4ea8ef5c
|
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221)
|
2024-01-03 11:30:22 -08:00 |
|
Woosuk Kwon
|
c3372e87be
|
Remove dependency on CuPy (#2152)
|
2023-12-17 01:49:07 -08:00 |
|
Woosuk Kwon
|
e1d5402238
|
Fix all-reduce memory usage (#2151)
|
2023-12-17 01:44:45 -08:00 |
|
Woosuk Kwon
|
37ca558103
|
Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-16 21:12:08 -08:00 |
|
Woosuk Kwon
|
30bad5c492
|
Fix peak memory profiling (#2031)
|
2023-12-12 22:01:53 -08:00 |
|
Woosuk Kwon
|
27feead2f8
|
Refactor Worker & InputMetadata (#1843)
|
2023-11-29 22:16:37 -08:00 |
|
boydfd
|
4bb6b67188
|
fix RAM OOM when load large models in tensor parallel mode. (#1395)
Co-authored-by: ran_lin <rlin@thoughtworks.com>
|
2023-11-20 19:02:42 -08:00 |
|
Simon Mo
|
5ffc0d13a2
|
Migrate linter from pylint to ruff (#1665)
|
2023-11-20 11:58:01 -08:00 |
|
Yanming W
|
8efe23f150
|
Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546)
|
2023-11-08 14:19:12 -08:00 |
|
Antoni Baum
|
9738b84a08
|
Force paged attention v2 for long contexts (#1510)
|
2023-11-01 16:24:32 -07:00 |
|
Woosuk Kwon
|
0ce8647dc5
|
Fix integer overflows in attention & cache ops (#1514)
|
2023-10-31 15:19:30 -07:00 |
|
Antoni Baum
|
15f5632365
|
Delay GPU->CPU sync in sampling (#1337)
|
2023-10-30 09:01:34 -07:00 |
|
Woosuk Kwon
|
c1376e0f82
|
Change scheduler & input tensor shape (#1381)
|
2023-10-16 17:48:42 -07:00 |
|
Antoni Baum
|
ee92b58b3a
|
Move bfloat16 check to worker (#1259)
|
2023-10-07 22:10:44 -07:00 |
|
Woosuk Kwon
|
2e8e49fce3
|
[Fix] Remove false assertion (#1222)
|
2023-09-28 10:52:38 -07:00 |
|
Woosuk Kwon
|
a8e98aee0c
|
Fix Mistral model (#1220)
|
2023-09-28 10:44:05 -07:00 |
|
Chris Bamford
|
bb1ba58f06
|
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
|
2023-09-28 10:41:03 -07:00 |
|
Antoni Baum
|
cf5cb1e33e
|
Allocate more shared memory to attention kernel (#1154)
|
2023-09-26 22:27:13 -07:00 |
|
Woosuk Kwon
|
2ac4d5e2bf
|
Replace DtypeTensor (#1123)
|
2023-09-21 00:51:47 -07:00 |
|
Zhuohan Li
|
002800f081
|
Align vLLM's beam search implementation with HF generate (#857)
|
2023-09-04 17:29:42 -07:00 |
|
Zhuohan Li
|
1b0bd0fe8a
|
Add Falcon support (new) (#592)
|
2023-08-02 14:04:39 -07:00 |
|
Antoni Baum
|
9925c17940
|
Ray placement group support (#397)
|
2023-07-19 22:49:31 -07:00 |
|
Zhuohan Li
|
d6fa1be3a8
|
[Quality] Add code formatter and linter (#326)
|
2023-07-03 11:31:55 -07:00 |
|
Zhuohan Li
|
0b7db411b5
|
[Bug] Fix the OOM condition for CPU cache (#260)
|
2023-06-26 11:16:13 -07:00 |
|
Woosuk Kwon
|
0b98ba15c7
|
Change the name to vLLM (#150)
|
2023-06-17 03:07:40 -07:00 |
|