Alexander Matveev
|
0310029a2f
|
[Bugfix] Fix awq_marlin and gptq_marlin flags (#6745)
|
2024-07-24 22:34:11 -07:00 |
|
Alphi
|
9e169a4c61
|
[Model] Adding support for MiniCPM-V (#4087)
|
2024-07-24 20:59:30 -07:00 |
|
liuyhwangyh
|
f4f8a9d892
|
[Bugfix]fix modelscope compatible issue (#6730)
|
2024-07-24 05:04:46 -07:00 |
|
Roger Wang
|
0a740a11ba
|
[Bugfix] Fix token padding for chameleon (#6724)
|
2024-07-24 01:05:09 -07:00 |
|
dongmao zhang
|
87525fab92
|
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-23 23:45:09 +00:00 |
|
Roger Wang
|
1bedf210e3
|
Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690)
|
2024-07-23 13:47:48 -07:00 |
|
Travis Johnson
|
507ef787d8
|
[Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-07-23 12:22:09 -07:00 |
|
Michael Goin
|
0eb0757bef
|
[Misc] Add ignored layers for fp8 quantization (#6657)
|
2024-07-23 14:04:04 -04:00 |
|
Woosuk Kwon
|
a112a84aad
|
[BugFix] Fix RoPE error in Llama 3.1 (#6693)
|
2024-07-23 09:46:05 -07:00 |
|
Simon Mo
|
3eda4ec780
|
support ignore patterns in model loader (#6673)
|
2024-07-22 23:59:42 -07:00 |
|
Roger Wang
|
22fa2e35cb
|
[VLM][Model] Support image input for Chameleon (#6633)
|
2024-07-22 23:50:48 -07:00 |
|
youkaichao
|
c5201240a4
|
[misc] only tqdm for first rank (#6672)
|
2024-07-22 21:57:27 -07:00 |
|
Michael Goin
|
9e0b558a09
|
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528)
|
2024-07-23 04:11:50 +00:00 |
|
zhaotyer
|
e519ae097a
|
add tqdm when loading checkpoint shards (#6569)
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-07-22 20:48:01 -07:00 |
|
Cheng Li
|
c5e8330997
|
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization (#6665)
|
2024-07-22 19:25:05 -07:00 |
|
Jae-Won Chung
|
89c1c6a196
|
[Bugfix] Fix vocab_size field access in llava_next.py (#6624)
|
2024-07-22 05:02:51 +00:00 |
|
Roger Wang
|
c9eef37f32
|
[Model] Initial Support for Chameleon (#5770)
|
2024-07-21 17:37:51 -07:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
Isotr0py
|
25e778aa16
|
[Model] Refactor and decouple phi3v image embedding (#6621)
|
2024-07-21 16:07:58 -07:00 |
|
Robert Shaw
|
082ecd80d5
|
[ Bugfix ] Fix AutoFP8 fp8 marlin (#6609)
|
2024-07-20 17:25:56 -06:00 |
|
Michael Goin
|
f952bbc8ff
|
[Misc] Fix input_scale typing in w8a8_utils.py (#6579)
|
2024-07-20 23:11:13 +00:00 |
|
Robert Shaw
|
9364f74eee
|
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606)
|
2024-07-20 18:50:10 +00:00 |
|
Matt Wong
|
06d6c5fe9f
|
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543)
|
2024-07-20 09:39:07 -07:00 |
|
Robert Shaw
|
683e3cb9c4
|
[ Misc ] fbgemm checkpoints (#6559)
|
2024-07-20 09:36:57 -07:00 |
|
Cyrus Leung
|
9042d68362
|
[Misc] Consolidate and optimize logic for building padded tensors (#6541)
|
2024-07-20 04:17:24 +00:00 |
|
Robert Shaw
|
4cc24f01b1
|
[ Kernel ] Enable Dynamic Per Token fp8 (#6547)
|
2024-07-19 23:08:15 +00:00 |
|
Antoni Baum
|
9ed82e7074
|
[Misc] Small perf improvements (#6520)
|
2024-07-19 12:10:56 -07:00 |
|
Thomas Parnell
|
a5314e8698
|
[Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-19 07:15:22 -06:00 |
|
Robert Shaw
|
dbe5588554
|
[ Misc ] non-uniform quantization via compressed-tensors for Llama (#6515)
|
2024-07-18 22:39:18 -04:00 |
|
Thomas Parnell
|
d4201e06d5
|
[Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-07-18 19:22:08 -07:00 |
|
Simon Mo
|
c5df56f88b
|
Add support for a rope extension method (#6553)
|
2024-07-19 01:53:03 +00:00 |
|
Tyler Michael Smith
|
4ffffccb7e
|
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552)
|
2024-07-18 23:52:22 +00:00 |
|
Michael Goin
|
15c6a079b1
|
[Model] Support Mistral-Nemo (#6548)
|
2024-07-18 20:31:50 +00:00 |
|
Robert Shaw
|
58ca663224
|
[ Misc ] Improve Min Capability Checking in compressed-tensors (#6522)
|
2024-07-18 14:39:12 +00:00 |
|
youkaichao
|
1c27d25fb5
|
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-17 20:54:35 -07:00 |
|
Robert Shaw
|
18fecc3559
|
[ Kernel ] Fp8 Channelwise Weight Support (#6487)
|
2024-07-18 03:18:13 +00:00 |
|
Cody Yu
|
b5af8c223c
|
[Model] Pipeline parallel support for Mixtral (#6516)
|
2024-07-17 19:26:04 -07:00 |
|
Alexander Matveev
|
e76466dde2
|
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338)
|
2024-07-17 14:30:28 -07:00 |
|
Wushi Dong
|
1d094fd7c0
|
[Distributed][PP] only create embedding & lm head when necessary (#6455)
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
|
2024-07-16 19:20:26 -07:00 |
|
youkaichao
|
ce37be7ba0
|
[misc][distributed] add seed to dummy weights (#6491)
|
2024-07-16 19:16:34 -07:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Woosuk Kwon
|
c467dff24f
|
[Hardware][TPU] Support MoE with Pallas GMM kernel (#6457)
|
2024-07-16 09:56:28 -07:00 |
|
Peng Guanwen
|
2bb0489cb3
|
[Core] Use numpy to speed up padded token processing (#6442)
|
2024-07-16 08:13:25 -07:00 |
|
Mor Zusman
|
9ad32dacd9
|
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425)
Co-authored-by: Mor Zusman <morz@ai21.com>
|
2024-07-16 01:32:55 +00:00 |
|
Woosuk Kwon
|
ec9933f4a5
|
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289)
|
2024-07-15 19:02:14 +00:00 |
|
youkaichao
|
4cf256ae7f
|
[misc][distributed] fix pp missing layer condition (#6446)
|
2024-07-15 10:32:35 -07:00 |
|
Tyler Michael Smith
|
c8fd97f26d
|
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270)
|
2024-07-15 13:05:52 -04:00 |
|
Roger Wang
|
6ae1597ddf
|
[VLM] Minor space optimization for ClipVisionModel (#6436)
|
2024-07-15 17:29:51 +08:00 |
|
youkaichao
|
69672f116c
|
[core][distributed] simplify code to support pipeline parallel (#6406)
|
2024-07-14 21:20:51 -07:00 |
|
Robert Shaw
|
73030b7dae
|
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423)
|
2024-07-14 21:38:42 +00:00 |
|