527 Commits

Author SHA1 Message Date
Cyrus Leung
e1a3f5e831
[CI/Build] Update models tests & examples (#8874)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-28 09:54:35 -07:00
Lucas Wilkinson
c5d55356f9
[Bugfix] fix for deepseek w4a16 (#8906)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-09-27 13:12:34 -06:00
Luka Govedič
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271) 2024-09-27 14:25:10 -04:00
Michael Goin
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors (#8588) 2024-09-25 09:43:36 -07:00
bnellnm
300da09177
[Kernel] Fullgraph and opcheck tests (#8479) 2024-09-25 08:35:52 -06:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Jee Jee Li
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768) 2024-09-24 17:08:55 -07:00
Lucas Wilkinson
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters (#8748) 2024-09-24 14:33:21 -04:00
jiqing-feng
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids (#8562) 2024-09-23 12:32:27 -07:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Lily Liu
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. (#8701) 2024-09-22 12:34:14 -07:00
rasmith
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (#8646) 2024-09-21 09:20:54 +00:00
盏一
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p (#8619) 2024-09-19 18:28:25 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) 2024-09-18 20:05:06 +00:00
Gregory Shtrasberg
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 10:41:08 -07:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) 2024-09-17 08:09:12 -07:00
Simon Mo
546034b466
[refactor] remove triton based sampler (#8524) 2024-09-16 20:04:48 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) 2024-09-16 11:52:40 -07:00
ElizaWszola
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) 2024-09-15 16:51:44 -06:00
Kunshang Ji
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
Co-authored-by: Yan Ma <yan.ma@intel.com>
2024-09-13 16:54:34 -07:00
Dipika Sikka
8f44a92d85
[BugFix] fix group_topk (#8430) 2024-09-13 09:23:42 +08:00
Woosuk Kwon
42ffba11ad
[Misc] Use RoPE cache for MRoPE (#8396) 2024-09-11 23:13:14 -07:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257) 2024-09-11 09:46:46 -07:00
Yang Fan
3b7fea770f
[Model][VLM] Add Qwen2-VL model support (#7905)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-11 09:31:19 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112) 2024-09-11 00:38:40 -04:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ (#8217) 2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering (#8135) 2024-09-09 16:27:26 -04:00
Dipika Sikka
23f322297f
[Misc] Remove SqueezeLLM (#8220) 2024-09-06 16:29:03 -06:00
rasmith
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248) 2024-09-06 16:26:09 -06:00
Michael Goin
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165) 2024-09-05 11:09:46 -04:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-05 12:48:10 +00:00
Woosuk Kwon
4624d98dbd
[Misc] Clean up RoPE forward_native (#8076) 2024-09-04 20:31:48 -07:00
Harsha vardhan manoj Bikki
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… (#8062)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-09-04 16:33:43 -07:00
Dipika Sikka
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters (#7972)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-03 20:12:41 -06:00
Dipika Sikka
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters (#7976) 2024-09-03 17:21:44 -04:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244) 2024-09-01 21:23:29 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
2024-08-30 13:42:57 -06:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step (#7652) 2024-08-29 19:19:08 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models (#7445) 2024-08-29 19:09:08 -04:00
Dipika Sikka
86a677de42
[misc] update tpu int8 to use new vLLM Parameters (#7973) 2024-08-29 16:46:55 -04:00
Isotr0py
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954) 2024-08-29 15:54:49 -04:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) 2024-08-28 15:06:52 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) 2024-08-28 15:37:47 -04:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
Dipika Sikka
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights (#7825) 2024-08-26 18:09:34 -06:00
Dipika Sikka
dd9857f5fa
[Misc] Update gptq_marlin_24 to use vLLMParameters (#7762)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-26 17:44:54 -04:00