Woosuk Kwon
e72ae80b06
[Bugfix] Support 2D input shape in MoE layer ( #6287 )
2024-07-10 09:03:16 -04:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
Baoyuan Qi
d3a245138a
[Bugfix]fix and needs_scalar_to_array logic check ( #6238 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-07-09 23:43:24 +00:00
tomeras91
ddc369fba1
[Bugfix] Mamba cache Cuda Graph padding ( #6214 )
2024-07-08 11:25:51 -07:00
Eric
185ad31f37
[Bugfix] use diskcache in outlines _get_guide #5436 ( #6203 )
2024-07-08 11:23:24 -07:00
Avshalom Manevich
f7a8fa39d8
[Kernel] reloading fused_moe config on the last chunk ( #6210 )
2024-07-08 08:00:38 -07:00
Robert Shaw
abfe705a02
[ Misc ] Support Fp8 via llm-compressor ( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
Roger Wang
6206dcb29e
[Model] Add PaliGemma ( #5189 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-07 09:25:50 +08:00
Cyrus Leung
ea4b570483
[VLM] Cleanup validation and update docs ( #6149 )
2024-07-05 05:49:38 +00:00
Roger Wang
a41357e941
[VLM] Improve consistency between feature size calculation and dummy data for profiling ( #6146 )
2024-07-05 09:29:47 +08:00
Cyrus Leung
ae96ef8fbd
[VLM] Calculate maximum number of multi-modal tokens by model ( #6121 )
2024-07-04 16:37:23 -07:00
Lily Liu
69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer ( #6051 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-04 16:35:51 -07:00
Robert Shaw
62963d129e
[ Misc ] Clean Up CompressedTensorsW8A8 ( #6113 )
2024-07-03 22:50:08 +00:00
xwjiang2010
d9e98f42e4
[vlm] Remove vision language config. ( #6089 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
Roger Wang
7cd2ebb025
[Bugfix] Fix compute_logits in Jamba ( #6093 )
2024-07-03 00:32:35 -07:00
Cyrus Leung
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-07-02 20:34:00 -07:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
Mor Zusman
9d6a8daa87
[Model] Jamba support ( #4115 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 23:11:29 +00:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
Robert Shaw
7c008c51a9
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral ( #5970 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-02 21:54:35 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
xwjiang2010
98d6682cd1
[VLM] Remove image_input_type from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
Thomas Parnell
54600709b6
[Model] Changes to MLPSpeculator to support tie_weights and input_scale ( #5965 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com>
2024-07-01 16:40:02 -07:00
Avshalom Manevich
12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs ( #6029 )
2024-07-01 21:08:29 +00:00
sroy745
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker ( #5348 )
2024-07-01 00:33:05 -07:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
Robert Shaw
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
Dipika Sikka
7836fdcc11
[Misc] Fix get_min_capability ( #5971 )
2024-06-30 20:15:16 +00:00
Cyrus Leung
99397da534
[CI/Build] Add TP test for vision models ( #5892 )
2024-06-29 15:45:54 +00:00
Robert Shaw
8dbfcd35bf
[ CI/Build ] Added E2E Test For Compressed Tensors ( #5839 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 21:12:58 +08:00
Cody Yu
f7dac83d95
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k ( #5939 )
2024-06-29 21:04:20 +08:00
Woosuk Kwon
580353da93
[Bugfix] Fix precisions in Gemma 1 ( #5913 )
2024-06-29 03:10:21 +00:00
Tyler Michael Smith
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
wangding zeng
be0b3af9e0
Support Deepseek-V2 ( #4650 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2024-06-28 13:24:57 -07:00
Robert Shaw
2cd402e169
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 ( #5921 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 18:43:49 +00:00
Robert Shaw
b185230744
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) ( #5928 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 13:49:57 -04:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend ( #5379 )
2024-06-28 13:50:16 +00:00
Tyler Michael Smith
5932634409
Unmark fused_moe config json file as executable ( #5960 )
2024-06-28 06:36:12 -07:00
Cyrus Leung
5cbe8d155c
[Core] Registry for processing model inputs ( #5214 )
...
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-28 12:09:56 +00:00
Divakar Verma
c3dde367f1
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X ( #5932 )
2024-06-27 13:41:08 -07:00
Woosuk Kwon
79c92c7c8a
[Model] Add Gemma 2 ( #5908 )
2024-06-27 13:33:56 -07:00
Nick Hill
691e29ecf3
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens ( #5876 )
2024-06-27 10:59:33 -07:00
Cyrus Leung
98cf2ed678
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision ( #5896 )
2024-06-27 09:08:10 -07:00
Roger Wang
2061f0b8a7
[Bugfix] Fix img_sizes Parsing in Phi3-Vision ( #5888 )
2024-06-27 08:29:24 +00:00
Cyrus Leung
96354d6a29
[Model] Add base class for LoRA-supported models ( #5018 )
2024-06-27 16:03:04 +08:00
Woosuk Kwon
6806998bf9
[Bugfix] Fix embedding to support 2D inputs ( #5829 )
2024-06-26 00:15:22 -07:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 ( #5794 )
2024-06-25 19:23:35 +00:00
Isotr0py
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00