606 Commits

Author SHA1 Message Date
Lucas Wilkinson
311f743831
[Bugfix] Fix gptq failure on T4s (#7264) 2024-08-07 20:05:37 +00:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) 2024-08-07 11:23:12 -07:00
Isotr0py
b764547616
[Bugfix] Fix input processor for InternVL2 model (#7164)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-07 09:32:07 -07:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874) 2024-08-07 09:17:58 -07:00
Michael Goin
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225) 2024-08-06 18:34:26 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Lily Liu
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests (#7183) 2024-08-06 10:40:32 -07:00
Cyrus Leung
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-08-06 16:55:31 +08:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Thomas Parnell
789937af2e
[Doc] [SpecDecode] Update MLPSpeculator documentation (#7100)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-08-05 23:29:43 +00:00
Jungho Christopher Cho
c0d8f1636c
[Model] SiglipVisionModel ported from transformers (#6942)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-05 06:22:12 +00:00
Alphi
7b86e7c9cd
[Model] Add multi-image support for minicpmv (#7122)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-05 09:23:17 +08:00
Jee Jee Li
179a6a36f2
[Model]Refactor MiniCPMV (#7020)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 08:12:41 +00:00
Yihuan Bu
654bc5ca49
Support for guided decoding for offline LLM (#6878)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 03:12:09 +00:00
Isotr0py
0c25435daa
[Model] Refactor and decouple weight loading logic for InternVL2 model (#7067) 2024-08-02 22:36:14 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq (#6883)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Peng Guanwen
db35186391
[Core] Comment out unused code in sampler (#7023) 2024-08-02 00:58:26 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn (#7022) 2024-08-01 13:14:37 -07:00
Murali Andoorveedu
fc912e0886
[Models] Support Qwen model with PP (#6974)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-08-01 12:40:43 -07:00
Michael Goin
f4fd390f5d
[Bugfix] Lower gemma's unloaded_params exception to warning (#7002) 2024-08-01 12:01:07 -07:00
Isotr0py
2dd34371a6
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm (#6992) 2024-08-01 12:00:28 -07:00
Sage Moore
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 (#6951)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-01 11:11:24 -07:00
Travis Johnson
630dd9e0ae
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-31 19:49:11 -07:00
Woosuk Kwon
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs (#6981) 2024-07-31 18:50:28 -07:00
xuyi
1d2e7fb73f
[Model] Pipeline parallel support for Qwen2 (#6924) 2024-07-31 18:49:51 -07:00
Michael Goin
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization (#6960) 2024-07-31 12:47:46 -07:00
Avshalom Manevich
2ee8d3ba55
[Model] use FusedMoE layer in Jamba (#6935) 2024-07-31 12:00:24 -07:00
Cyrus Leung
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982) 2024-07-31 23:46:17 +08:00
Alphi
2f4e108f75
[Bugfix] Clean up MiniCPM-V (#6939)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-07-31 14:39:19 +00:00
HandH1998
6512937de1
Support W4A8 quantization for vllm (#5218) 2024-07-31 07:55:21 -06:00
Tyler Michael Smith
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) 2024-07-30 16:37:01 -04:00
Nick Hill
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel (#6698) 2024-07-30 10:40:08 -07:00
Roger Wang
c66c7f86ac
[Bugfix] Fix PaliGemma MMP (#6930) 2024-07-30 02:20:57 -07:00
Thomas Parnell
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-29 14:51:27 -07:00
Peng Guanwen
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None (#6532) 2024-07-29 16:47:31 +00:00
Isotr0py
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models (#6514)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-29 10:16:30 +00:00
Elsa Granger
3eeb148f46
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (#6871) 2024-07-28 11:13:49 -04:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
Cyrus Leung
1ad86acf17
[Model] Initial support for BLIP-2 (#5920)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-07-27 11:53:07 +00:00
tomeras91
ed94e4f427
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba (#6784) 2024-07-26 20:45:31 -07:00
Woosuk Kwon
d09b94ca58
[TPU] Support collective communications in XLA devices (#6813) 2024-07-27 01:45:57 +00:00
Michael Goin
281977bd6e
[Doc] Add Nemotron to supported model docs (#6843) 2024-07-26 17:32:44 -04:00
Michael Goin
07278c37dd
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611) 2024-07-26 14:33:42 -04:00
Peng Guanwen
89a84b0bb7
[Core] Use array to speedup padding (#6779) 2024-07-25 21:31:31 -07:00
QQSong
062a1d0fab
Fix ReplicatedLinear weight loading (#6793) 2024-07-25 19:24:58 -07:00
Lucas Wilkinson
cd7edc4e87
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors (#6798) 2024-07-25 15:05:09 -07:00
Michael Goin
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761) 2024-07-25 09:46:15 -07:00
Robert Shaw
889da130e7
[ Misc ] fp8-marlin channelwise via compressed-tensors (#6524)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-07-25 09:46:04 -07:00
Alexander Matveev
0310029a2f
[Bugfix] Fix awq_marlin and gptq_marlin flags (#6745) 2024-07-24 22:34:11 -07:00