Jae-Won Chung
|
89c1c6a196
|
[Bugfix] Fix vocab_size field access in llava_next.py (#6624)
|
2024-07-22 05:02:51 +00:00 |
|
Woosuk Kwon
|
42de2cefcb
|
[Misc] Add a wrapper for torch.inference_mode (#6618)
|
2024-07-21 18:43:11 -07:00 |
|
Roger Wang
|
c9eef37f32
|
[Model] Initial Support for Chameleon (#5770)
|
2024-07-21 17:37:51 -07:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
Isotr0py
|
25e778aa16
|
[Model] Refactor and decouple phi3v image embedding (#6621)
|
2024-07-21 16:07:58 -07:00 |
|
Woosuk Kwon
|
b6df37f943
|
[Misc] Remove abused noqa (#6619)
|
2024-07-21 23:47:04 +08:00 |
|
sroy745
|
14f91fe67c
|
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
|
2024-07-20 23:58:58 -07:00 |
|
Cyrus Leung
|
d7f4178dd9
|
[Frontend] Move chat utils (#6602)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-21 08:38:17 +08:00 |
|
Robert Shaw
|
082ecd80d5
|
[ Bugfix ] Fix AutoFP8 fp8 marlin (#6609)
|
2024-07-20 17:25:56 -06:00 |
|
Michael Goin
|
f952bbc8ff
|
[Misc] Fix input_scale typing in w8a8_utils.py (#6579)
|
2024-07-20 23:11:13 +00:00 |
|
Robert Shaw
|
9364f74eee
|
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606)
|
2024-07-20 18:50:10 +00:00 |
|
Matt Wong
|
06d6c5fe9f
|
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543)
|
2024-07-20 09:39:07 -07:00 |
|
Robert Shaw
|
683e3cb9c4
|
[ Misc ] fbgemm checkpoints (#6559)
|
2024-07-20 09:36:57 -07:00 |
|
Cyrus Leung
|
9042d68362
|
[Misc] Consolidate and optimize logic for building padded tensors (#6541)
|
2024-07-20 04:17:24 +00:00 |
|
Travis Johnson
|
3f8d42c81f
|
Pipeline Parallel: Guard for KeyErrors at request abort (#6587)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-07-19 19:18:19 -07:00 |
|
Antoni Baum
|
7bd82002ae
|
[Core] Allow specifying custom Executor (#6557)
|
2024-07-20 01:25:06 +00:00 |
|
Varun Sundar Rabindranath
|
2e26564259
|
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593)
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
|
2024-07-19 18:15:26 -07:00 |
|
youkaichao
|
e81522e879
|
[build] add ib in image for out-of-the-box infiniband support (#6599)
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599)
|
2024-07-19 17:16:57 -07:00 |
|
Murali Andoorveedu
|
45ceb85a0c
|
[Docs] Update PP docs (#6598)
|
2024-07-19 16:38:21 -07:00 |
|
Robert Shaw
|
4cc24f01b1
|
[ Kernel ] Enable Dynamic Per Token fp8 (#6547)
|
2024-07-19 23:08:15 +00:00 |
|
youkaichao
|
07eb6f19f3
|
[bugfix][distributed] fix multi-node bug for shared memory (#6597)
|
2024-07-19 15:34:34 -07:00 |
|
Thomas Parnell
|
f0bbfaf917
|
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578)
|
2024-07-19 14:01:03 -07:00 |
|
Simon Mo
|
30efe41532
|
[Docs] Update docs for wheel location (#6580)
|
2024-07-19 12:14:11 -07:00 |
|
Antoni Baum
|
9ed82e7074
|
[Misc] Small perf improvements (#6520)
|
2024-07-19 12:10:56 -07:00 |
|
Daniele
|
51f8aa90ad
|
[Bugfix][Frontend] remove duplicate init logger (#6581)
|
2024-07-19 10:16:27 -07:00 |
|
Thomas Parnell
|
a5314e8698
|
[Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-19 07:15:22 -06:00 |
|
Woo-Yeon Lee
|
a921e86392
|
[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369)
|
2024-07-19 06:01:09 -07:00 |
|
Cyrus Leung
|
6366efc67b
|
[Bugfix][Frontend] Fix missing /metrics endpoint (#6463)
|
2024-07-19 03:55:13 +00:00 |
|
Robert Shaw
|
dbe5588554
|
[ Misc ] non-uniform quantization via compressed-tensors for Llama (#6515)
|
2024-07-18 22:39:18 -04:00 |
|
Thomas Parnell
|
d4201e06d5
|
[Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-07-18 19:22:08 -07:00 |
|
Nick Hill
|
b5672a112c
|
[Core] Multiprocessing Pipeline Parallel support (#6130)
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-18 19:15:52 -07:00 |
|
Simon Mo
|
c5df56f88b
|
Add support for a rope extension method (#6553)
|
2024-07-19 01:53:03 +00:00 |
|
Tyler Michael Smith
|
1689219ebf
|
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517)
|
2024-07-18 17:29:25 -07:00 |
|
Tyler Michael Smith
|
4ffffccb7e
|
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552)
|
2024-07-18 23:52:22 +00:00 |
|
youkaichao
|
f53b8f0d05
|
[ci][test] add correctness test for cpu offloading (#6549)
|
2024-07-18 23:41:06 +00:00 |
|
Kevin H. Luu
|
2d4733ba2d
|
Fix PR comment bot (#6554)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-07-18 14:48:29 -07:00 |
|
Michael Goin
|
15c6a079b1
|
[Model] Support Mistral-Nemo (#6548)
|
2024-07-18 20:31:50 +00:00 |
|
Kevin H. Luu
|
ecdb462c24
|
[ci] Reword Github bot comment (#6534)
|
2024-07-18 08:01:45 -07:00 |
|
Robert Shaw
|
58ca663224
|
[ Misc ] Improve Min Capability Checking in compressed-tensors (#6522)
|
2024-07-18 14:39:12 +00:00 |
|
Woosuk Kwon
|
4634c8728b
|
[TPU] Refactor TPU worker & model runner (#6506)
|
2024-07-18 01:34:16 -07:00 |
|
Noam Gat
|
c8a7d51c49
|
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501)
|
2024-07-18 07:47:13 +00:00 |
|
Nick Hill
|
e2fbaee725
|
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-18 15:13:30 +08:00 |
|
Cody Yu
|
8a74c68bd1
|
[Misc] Minor patch for draft model runner (#6523)
|
2024-07-18 06:06:21 +00:00 |
|
Rui Qiao
|
61e592747c
|
[Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
|
2024-07-17 22:27:09 -07:00 |
|
Nick Hill
|
d25877dd9b
|
[BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530)
|
2024-07-17 22:24:43 -07:00 |
|
youkaichao
|
1c27d25fb5
|
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-17 20:54:35 -07:00 |
|
Robert Shaw
|
18fecc3559
|
[ Kernel ] Fp8 Channelwise Weight Support (#6487)
|
2024-07-18 03:18:13 +00:00 |
|
Cody Yu
|
b5af8c223c
|
[Model] Pipeline parallel support for Mixtral (#6516)
|
2024-07-17 19:26:04 -07:00 |
|
Varun Sundar Rabindranath
|
b5241e41d9
|
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-18 01:38:35 +00:00 |
|
Alexander Matveev
|
e76466dde2
|
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338)
|
2024-07-17 14:30:28 -07:00 |
|