Mor Zusman
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
Jee Jee Li
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
Zilin Zhu
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
ElizaWszola
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-28 18:19:40 -07:00
Cyrus Leung
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-28 09:54:35 -07:00
Lucas Wilkinson
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-09-27 13:12:34 -06:00
Luka Govedič
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
Isotr0py
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 ( #8892 )
2024-09-27 01:15:58 -07:00
Roger Wang
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00
Michael Goin
7193774b1f
[Misc] Support quantization of MllamaForCausalLM ( #8822 )
2024-09-25 14:46:22 -07:00
Chen Zhang
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-25 13:29:32 -07:00
Michael Goin
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
DefTruth
0c4d2ad5e6
[VLM][Bugfix] internvl with num_scheduler_steps > 1 ( #8614 )
2024-09-25 09:35:53 -07:00
bnellnm
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
sohamparikh
3e073e66f1
[Bugfix] load fc bias from config for eagle ( #8790 )
2024-09-24 23:16:30 -07:00
Isotr0py
c23953675f
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend ( #8770 )
2024-09-24 23:16:11 -07:00
zifeitong
e3dd0692fa
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv ( #8250 )
2024-09-25 05:53:43 +00:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Jee Jee Li
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
Lucas Wilkinson
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters ( #8748 )
2024-09-24 14:33:21 -04:00
Alex Brooks
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-24 07:36:46 +00:00
Peter Salas
3f06bae907
[Core][Model] Support loading weights by ID within models ( #7931 )
2024-09-24 07:14:15 +00:00
jiqing-feng
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Jani Monoses
f2bd246c17
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size ( #8707 )
2024-09-23 14:43:09 +00:00
Yanyi Liu
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
Lily Liu
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
litianjian
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-22 10:51:44 -07:00
Cyrus Leung
06ed2815e2
[Model] Refactor BLIP/BLIP-2 to support composite model loading ( #8407 )
2024-09-22 12:24:21 +00:00
Isotr0py
13d88d4137
[Bugfix] Refactor composite weight loading logic ( #8656 )
2024-09-22 04:33:27 +00:00
Divakar Verma
9dc7c6c7f3
[dbrx] refactor dbrx experts to extend FusedMoe class ( #8518 )
2024-09-21 15:09:39 -06:00
rasmith
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 ( #8646 )
2024-09-21 09:20:54 +00:00
Cyrus Leung
5e85f4f82a
[VLM] Use SequenceData.from_token_counts to create dummy data ( #8687 )
2024-09-20 23:28:56 -07:00
zyddnys
0f961b3ce9
[Bugfix] Fix incorrect llava next feature size calculation ( #8496 )
2024-09-20 22:48:32 +00:00
Niklas Muennighoff
3b63de9353
[Model] Add OLMoE ( #7922 )
2024-09-20 09:31:41 -07:00
Amit Garg
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference ( #8571 )
2024-09-20 08:54:02 +08:00
盏一
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p ( #8619 )
2024-09-19 18:28:25 +00:00
Roger Wang
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" ( #8593 )
2024-09-19 04:14:28 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
Gregory Shtrasberg
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call ( #8380 )
...
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 10:41:08 -07:00
Geun, Lim
e18749ff09
[Model] Support Solar Model ( #8386 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 11:04:00 -06:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
Joe Runde
98f9713399
[Bugfix] Fix TP > 1 for new granite ( #8544 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-17 23:17:08 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
Roger Wang
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer ( #8521 )
2024-09-16 22:22:45 -07:00
Simon Mo
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00