557 Commits

Author SHA1 Message Date
Alexander Matveev
0310029a2f
[Bugfix] Fix awq_marlin and gptq_marlin flags (#6745) 2024-07-24 22:34:11 -07:00
Alphi
9e169a4c61
[Model] Adding support for MiniCPM-V (#4087) 2024-07-24 20:59:30 -07:00
liuyhwangyh
f4f8a9d892
[Bugfix]fix modelscope compatible issue (#6730) 2024-07-24 05:04:46 -07:00
Roger Wang
0a740a11ba
[Bugfix] Fix token padding for chameleon (#6724) 2024-07-24 01:05:09 -07:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Roger Wang
1bedf210e3
Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690) 2024-07-23 13:47:48 -07:00
Travis Johnson
507ef787d8
[Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-23 12:22:09 -07:00
Michael Goin
0eb0757bef
[Misc] Add ignored layers for fp8 quantization (#6657) 2024-07-23 14:04:04 -04:00
Woosuk Kwon
a112a84aad
[BugFix] Fix RoPE error in Llama 3.1 (#6693) 2024-07-23 09:46:05 -07:00
Simon Mo
3eda4ec780
support ignore patterns in model loader (#6673) 2024-07-22 23:59:42 -07:00
Roger Wang
22fa2e35cb
[VLM][Model] Support image input for Chameleon (#6633) 2024-07-22 23:50:48 -07:00
youkaichao
c5201240a4
[misc] only tqdm for first rank (#6672) 2024-07-22 21:57:27 -07:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528) 2024-07-23 04:11:50 +00:00
zhaotyer
e519ae097a
add tqdm when loading checkpoint shards (#6569)
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-22 20:48:01 -07:00
Cheng Li
c5e8330997
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization (#6665) 2024-07-22 19:25:05 -07:00
Jae-Won Chung
89c1c6a196
[Bugfix] Fix vocab_size field access in llava_next.py (#6624) 2024-07-22 05:02:51 +00:00
Roger Wang
c9eef37f32
[Model] Initial Support for Chameleon (#5770) 2024-07-21 17:37:51 -07:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 2024-07-21 19:41:42 -04:00
Isotr0py
25e778aa16
[Model] Refactor and decouple phi3v image embedding (#6621) 2024-07-21 16:07:58 -07:00
Robert Shaw
082ecd80d5
[ Bugfix ] Fix AutoFP8 fp8 marlin (#6609) 2024-07-20 17:25:56 -06:00
Michael Goin
f952bbc8ff
[Misc] Fix input_scale typing in w8a8_utils.py (#6579) 2024-07-20 23:11:13 +00:00
Robert Shaw
9364f74eee
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606) 2024-07-20 18:50:10 +00:00
Matt Wong
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) 2024-07-20 09:39:07 -07:00
Robert Shaw
683e3cb9c4
[ Misc ] fbgemm checkpoints (#6559) 2024-07-20 09:36:57 -07:00
Cyrus Leung
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors (#6541) 2024-07-20 04:17:24 +00:00
Robert Shaw
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token fp8 (#6547) 2024-07-19 23:08:15 +00:00
Antoni Baum
9ed82e7074
[Misc] Small perf improvements (#6520) 2024-07-19 12:10:56 -07:00
Thomas Parnell
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-19 07:15:22 -06:00
Robert Shaw
dbe5588554
[ Misc ] non-uniform quantization via compressed-tensors for Llama (#6515) 2024-07-18 22:39:18 -04:00
Thomas Parnell
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-07-18 19:22:08 -07:00
Simon Mo
c5df56f88b
Add support for a rope extension method (#6553) 2024-07-19 01:53:03 +00:00
Tyler Michael Smith
4ffffccb7e
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552) 2024-07-18 23:52:22 +00:00
Michael Goin
15c6a079b1
[Model] Support Mistral-Nemo (#6548) 2024-07-18 20:31:50 +00:00
Robert Shaw
58ca663224
[ Misc ] Improve Min Capability Checking in compressed-tensors (#6522) 2024-07-18 14:39:12 +00:00
youkaichao
1c27d25fb5
[core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-17 20:54:35 -07:00
Robert Shaw
18fecc3559
[ Kernel ] Fp8 Channelwise Weight Support (#6487) 2024-07-18 03:18:13 +00:00
Cody Yu
b5af8c223c
[Model] Pipeline parallel support for Mixtral (#6516) 2024-07-17 19:26:04 -07:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) 2024-07-17 14:30:28 -07:00
Wushi Dong
1d094fd7c0
[Distributed][PP] only create embedding & lm head when necessary (#6455)
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
2024-07-16 19:20:26 -07:00
youkaichao
ce37be7ba0
[misc][distributed] add seed to dummy weights (#6491) 2024-07-16 19:16:34 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) 2024-07-16 15:31:32 -07:00
Woosuk Kwon
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel (#6457) 2024-07-16 09:56:28 -07:00
Peng Guanwen
2bb0489cb3
[Core] Use numpy to speed up padded token processing (#6442) 2024-07-16 08:13:25 -07:00
Mor Zusman
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425)
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-07-16 01:32:55 +00:00
Woosuk Kwon
ec9933f4a5
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) 2024-07-15 19:02:14 +00:00
youkaichao
4cf256ae7f
[misc][distributed] fix pp missing layer condition (#6446) 2024-07-15 10:32:35 -07:00
Tyler Michael Smith
c8fd97f26d
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270) 2024-07-15 13:05:52 -04:00
Roger Wang
6ae1597ddf
[VLM] Minor space optimization for ClipVisionModel (#6436) 2024-07-15 17:29:51 +08:00
youkaichao
69672f116c
[core][distributed] simplify code to support pipeline parallel (#6406) 2024-07-14 21:20:51 -07:00
Robert Shaw
73030b7dae
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423) 2024-07-14 21:38:42 +00:00