Thomas Parnell
|
cf2f084d56
|
Dynamic scheduler delay to improve ITL performance (#3279)
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
|
2024-03-22 12:28:14 -07:00 |
|
Hanzhi Zhou
|
f721096d48
|
[BugFix] Some fixes for custom allreduce kernels (#2760)
|
2024-03-21 23:02:58 -07:00 |
|
Zhuohan Li
|
e90fc21f2e
|
[Hardware][Neuron] Refactor neuron support (#3471)
|
2024-03-22 01:22:17 +00:00 |
|
Roy
|
ea5f14e6ff
|
[Bugfix][Model] Fix Qwen2 (#3554)
|
2024-03-22 00:18:58 +00:00 |
|
Taemin Lee
|
b7050ca7df
|
[BugFix] gemma loading after quantization or LoRA. (#3553)
|
2024-03-21 13:16:57 -07:00 |
|
Woosuk Kwon
|
c188ecb080
|
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (#3551)
Co-authored-by: Roy <jasonailu87@gmail.com>
Co-authored-by: Roger Meier <r.meier@siemens.com>
|
2024-03-21 07:58:12 -07:00 |
|
Roy
|
865732342b
|
[Misc][Log] Add log for tokenizer length not equal to vocabulary size (#3500)
|
2024-03-21 18:07:48 +08:00 |
|
Lalit Pradhan
|
4c07dd28c0
|
[🚀 Ready to be merged] Added support for Jais models (#3183)
|
2024-03-21 09:45:24 +00:00 |
|
SangBin Cho
|
3bbff9e5ab
|
Fix 1D query issue from _prune_hidden_states (#3539)
|
2024-03-21 08:49:06 +00:00 |
|
ElizaWszola
|
6ebd02bdef
|
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (#3431)
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Luka <luka@paperspace>
|
2024-03-20 23:20:04 -07:00 |
|
Roy
|
f1c0fc3919
|
Migrate logits computation and gather to model_runner (#3233)
|
2024-03-20 23:25:01 +00:00 |
|
SangBin Cho
|
6e435de766
|
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
|
2024-03-20 14:46:05 -07:00 |
|
Antoni Baum
|
426ec4ec67
|
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-20 14:45:08 -07:00 |
|
Woosuk Kwon
|
5ee14494e4
|
[Misc] Remove cache stream and cache events (#3461)
|
2024-03-20 00:38:53 -07:00 |
|
Nick Hill
|
4ad521d8b5
|
[Core] Add generic typing to LRUCache (#3511)
|
2024-03-20 00:36:09 -07:00 |
|
ElizaWszola
|
9474e89ba4
|
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-20 00:11:11 -07:00 |
|
Simon Mo
|
20478c4d3a
|
Use lru_cache for some environment detection utils (#3508)
|
2024-03-19 21:34:15 +00:00 |
|
Simon Mo
|
cc63d03fbb
|
Revert "[Core] Cache some utils" (#3507)
|
2024-03-19 13:22:58 -07:00 |
|
Nick Hill
|
7341c77d69
|
[BugFix] Avoid initializing CUDA too early (#3487)
|
2024-03-18 23:05:20 -07:00 |
|
Simon Mo
|
ef65dcfa6f
|
[Doc] Add docs about OpenAI compatible server (#3288)
|
2024-03-18 22:05:34 -07:00 |
|
youkaichao
|
6a9c583e73
|
[Core] print error before deadlock (#3459)
|
2024-03-19 04:06:23 +00:00 |
|
Antoni Baum
|
b37cdce2b1
|
[Core] Cache some utils (#3474)
|
2024-03-18 17:14:26 -07:00 |
|
Antoni Baum
|
49eedea373
|
[Core] Zero-copy asdict for InputMetadata (#3475)
|
2024-03-18 22:56:40 +00:00 |
|
Woosuk Kwon
|
abfc4f3387
|
[Misc] Use dataclass for InputMetadata (#3452)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-03-17 10:02:46 +00:00 |
|
Simon Mo
|
120157fd2a
|
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211)
|
2024-03-16 13:35:27 -07:00 |
|
Robert Shaw
|
10585e035e
|
Removed Extraneous Print Message From OAI Server (#3440)
|
2024-03-16 00:35:36 +00:00 |
|
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
|
Tao He
|
14b8ae02e7
|
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-03-15 18:25:43 +00:00 |
|
Dan Clark
|
03d37f2441
|
[Fix] Add args for mTLS support (#3430)
Co-authored-by: declark1 <daniel.clark@ibm.com>
|
2024-03-15 09:56:13 -07:00 |
|
Yang Fan
|
a7c871680e
|
Fix tie_word_embeddings for Qwen2. (#3344)
|
2024-03-15 09:36:53 -07:00 |
|
Junda Chen
|
429284dc37
|
Fix dist.broadcast stall without group argument (#3408)
|
2024-03-14 23:25:05 -07:00 |
|
youkaichao
|
b522c4476f
|
[Misc] add HOST_IP env var (#3419)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-14 21:32:52 -07:00 |
|
Enrique Shockwave
|
b983ba35bd
|
fix marlin config repr (#3414)
|
2024-03-14 16:26:19 -07:00 |
|
陈序
|
54be8a0be2
|
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
|
2024-03-14 13:56:57 -07:00 |
|
Dan Clark
|
c17ca8ef18
|
Add args for mTLS support (#3410)
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
|
2024-03-14 13:11:45 -07:00 |
|
youkaichao
|
8fe8386591
|
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
|
2024-03-14 08:11:48 +00:00 |
|
Zhuohan Li
|
eeab52a4ff
|
[FIX] Simpler fix for async engine running on ray (#3371)
|
2024-03-13 14:18:40 -07:00 |
|
Antoni Baum
|
c33afd89f5
|
Fix lint (#3388)
|
2024-03-13 13:56:49 -07:00 |
|
Terry
|
7e9bd08f60
|
Add batched RoPE kernel (#3095)
|
2024-03-13 13:45:26 -07:00 |
|
Hui Liu
|
ba8dc958a3
|
[Minor] Fix bias in if to remove ambiguity (#3259)
|
2024-03-13 09:16:55 -07:00 |
|
Bo-Wen Wang
|
b167109ba1
|
[Fix] Fix quantization="gptq" when using Marlin (#3319)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-03-12 22:51:42 -07:00 |
|
Woosuk Kwon
|
602358f8a8
|
Add kernel for GeGLU with approximate GELU (#3337)
|
2024-03-12 22:06:17 -07:00 |
|
Breno Faria
|
49a3c8662b
|
Fixes #1556 double free (#3347)
|
2024-03-13 00:30:08 +00:00 |
|
DAIZHENWEI
|
654865e21d
|
Support Mistral Model Inference with transformers-neuronx (#3153)
|
2024-03-11 13:19:51 -07:00 |
|
Zhuohan Li
|
4c922709b6
|
Add distributed model executor abstraction (#3191)
|
2024-03-11 11:03:45 -07:00 |
|
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
|
Nick Hill
|
4b59f00e91
|
[Fix] Fix best_of behavior when n=1 (#3298)
|
2024-03-10 19:17:46 -07:00 |
|
Roy
|
9e8744a545
|
[BugFix] Fix get tokenizer when using ray (#3301)
|
2024-03-10 19:17:16 -07:00 |
|
Cade Daniel
|
8437bae6ef
|
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103)
|
2024-03-08 23:32:46 -08:00 |
|
Zhuohan Li
|
f48c6791b7
|
[FIX] Fix prefix test error on main (#3286)
|
2024-03-08 17:16:14 -08:00 |
|