yancong
|
32eb0da808
|
[Misc] Support register quantization method out-of-tree (#11969)
|
2025-01-18 16:13:16 -08:00 |
|
Gregory Shtrasberg
|
b5b57e301e
|
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
2025-01-17 17:12:26 +00:00 |
|
Li, Jiang
|
d4e6194570
|
[CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-01-17 19:39:52 +08:00 |
|
youkaichao
|
bf53e0c70b
|
Support torchrun and SPMD-style offline inference (#12071)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-16 19:58:53 +08:00 |
|
Michael Goin
|
9aa1519f08
|
Various cosmetic/comment fixes (#12089)
Signed-off-by: mgoin <michael@neuralmagic.com>
|
2025-01-16 09:59:06 +00:00 |
|
Elfie Guo
|
fa0050db08
|
[Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2025-01-16 04:31:27 +00:00 |
|
kewang-xlnx
|
de0526f668
|
[Misc][Quark] Upstream Quark format to VLLM (#10765)
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2025-01-15 11:05:15 -05:00 |
|
Rahul Tuli
|
cbe94391eb
|
Fix: cases with empty sparsity config (#12057)
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
|
2025-01-15 17:41:24 +08:00 |
|
Jee Jee Li
|
42f5e7c52a
|
[Kernel] Support MulAndSilu (#11624)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-01-15 02:29:53 +00:00 |
|
Steve Luo
|
f35ec461fc
|
[Bugfix] Fix deepseekv3 gate bias error (#12002)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2025-01-13 13:43:51 -07:00 |
|
Isotr0py
|
d14e98d924
|
[Model] Support GGUF models newly added in transformers 4.46.0 (#9685)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-01-13 00:13:44 +00:00 |
|
Avshalom Manevich
|
263a870ee1
|
[Hardware][TPU] workaround fix for MoE on TPU (#11764)
|
2025-01-12 10:53:51 -05:00 |
|
shaochangxu
|
c32a7c7c0c
|
[Bugfix] fused_experts_impl wrong compute type for float32 (#11921)
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
|
2025-01-11 13:49:39 +08:00 |
|
Li, Jiang
|
aa1e77a19c
|
[Hardware][CPU] Support MOE models on x86 CPU (#11831)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-01-10 11:07:58 -05:00 |
|
wangxiyuan
|
20410b2fda
|
[platform] support custom torch.compile backend key (#11318)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-01-10 23:46:51 +08:00 |
|
cennn
|
d907be7dc7
|
[misc] remove python function call for custom activation op (#11885)
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-01-10 17:18:25 +08:00 |
|
Cyrus Leung
|
d848800e88
|
[Misc] Move print_*_once from utils to logger (#11298)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
|
2025-01-09 12:48:12 +08:00 |
|
rasmith
|
526de822d5
|
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (#11698)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
|
2025-01-08 20:23:15 +00:00 |
|
Robert Shaw
|
56fe4c297c
|
[TPU][Quantization] TPU W8A8 (#11785)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-01-08 19:33:29 +00:00 |
|
Yan Ma
|
78f4590b60
|
[Bugfix][XPU] fix silu_and_mul (#11823)
Signed-off-by: yan ma <yan.ma@intel.com>
|
2025-01-09 00:11:50 +08:00 |
|
Li, Jiang
|
2f7024987e
|
[CI/Build][Bugfix] Fix CPU CI image clean up (#11836)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-01-08 15:18:28 +00:00 |
|
youkaichao
|
869579a702
|
[optimization] remove python function call for custom op (#11750)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-07 17:04:28 +00:00 |
|
Lucas Tucker
|
9c749713f6
|
[mypy] Forward pass function type hints in lora (#11740)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
|
2025-01-06 07:59:36 +00:00 |
|
Cyrus Leung
|
65c08928c2
|
[Model] Remove unnecessary weight initialization logic (#11736)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
|
2025-01-04 23:46:21 +08:00 |
|
Lu Fang
|
07064cb1d4
|
[Bugfix] Check chain_speculative_sampling before calling it (#11673)
Signed-off-by: Lu Fang <lufang@fb.com>
|
2025-01-02 16:58:56 -08:00 |
|
Kazuhiro Serizawa
|
6d70198b17
|
[Doc] Fix typo (#11666)
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com>
|
2025-01-01 08:10:10 +00:00 |
|
Li, Jiang
|
5dbf854553
|
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2024-12-30 10:17:04 +00:00 |
|
Michael Goin
|
0aa38d16f5
|
Remove print statement in DeepseekScalingRotaryEmbedding (#11604)
|
2024-12-29 20:16:46 +00:00 |
|
youkaichao
|
dba4d9dec6
|
[v1][bugfix] fix cudagraph with inplace buffer assignment (#11596)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-29 09:03:49 +00:00 |
|
Selali
|
ac79799403
|
[Bugfix] Fix for ROCM compressed tensor support (#11561)
|
2024-12-27 20:12:11 +00:00 |
|
ErezSC42
|
55509c2114
|
[MODEL] LoRA support for Jamba model (#11209)
Signed-off-by: Erez Schwartz <erezs@ai21.com>
|
2024-12-27 17:58:21 +00:00 |
|
Robert Shaw
|
2339d59f92
|
[BugFix] Fix quantization for all other methods (#11547)
|
2024-12-26 22:23:29 -08:00 |
|
Simon Mo
|
f49777ba62
|
Deepseek v3 (#11502)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com>
|
2024-12-26 16:09:44 -08:00 |
|
Michael Goin
|
2072924d14
|
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-26 15:33:30 -08:00 |
|
sroy745
|
dcb1a944d4
|
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (#10681)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-12-26 19:02:58 +09:00 |
|
Dipika Sikka
|
b866cdbd05
|
[Misc] Add assertion and helpful message for marlin24 compressed models (#11388)
|
2024-12-24 02:23:38 +08:00 |
|
George
|
51ff216d85
|
[Bugfix] update should_ignore_layer (#11354)
Signed-off-by: George Ohashi <george@neuralmagic.com>
|
2024-12-21 06:36:23 +00:00 |
|
Wallas Henrique
|
86c2d8fd1c
|
[Bugfix] Fix spec decoding when seed is none in a batch (#10863)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
2024-12-20 05:15:31 +00:00 |
|
Isotr0py
|
276738ce0f
|
[Bugfix] Fix broken CPU compressed-tensors test (#11338)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-12-19 17:37:31 +00:00 |
|
Tyler Michael Smith
|
5a9da2e6e9
|
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-19 02:43:30 +00:00 |
|
Dipika Sikka
|
60508ffda9
|
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
|
2024-12-18 09:57:16 -05:00 |
|
Jee Jee Li
|
15859f2357
|
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 (#11201)
|
2024-12-15 03:03:06 +00:00 |
|
Cyrus Leung
|
eeec9e3390
|
[Frontend] Separate pooling APIs in offline inference (#11129)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-13 10:40:07 +00:00 |
|
Tyler Michael Smith
|
28b3a1c7e5
|
[V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-10 06:28:14 +00:00 |
|
Isotr0py
|
d1f6d1c8af
|
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba (#10739)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-12-10 10:23:07 +08:00 |
|
Cyrus Leung
|
133707123e
|
[Model] Replace embedding models with pooling adapter (#10769)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-01 08:02:54 +08:00 |
|
Cyrus Leung
|
fa6ecb9aa7
|
[Model] Clean up MiniCPMV (#10751)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-29 04:47:06 +00:00 |
|
Isotr0py
|
b98c62ba49
|
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-11-27 10:43:17 -08:00 |
|
yansh97
|
cfb3bf25fb
|
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig (#10657)
|
2024-11-27 13:55:23 +08:00 |
|
Chendi.Xue
|
0a71900bc9
|
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
2024-11-26 17:57:11 -08:00 |
|