4658 Commits

Author SHA1 Message Date
Lucas Wilkinson
c786e757fa
[Attention] Use FA3 for MLA on Hopper (#12807)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-06 11:43:12 +00:00
Simon Mo
cefd56ee35
[Docs] Add Google Cloud Slides (#12814) 2025-02-06 01:02:38 -08:00
Dipika Sikka
7ca9934fe7
[Misc] Update w2 scale loading for GPTQMarlinMoE (#12757) 2025-02-06 01:02:14 -08:00
youkaichao
0408efc6d0
[Misc] Improve error message for incorrect pynvml (#12809)
Signed-off-by: youkaichao <youkaichao@gmail.com>
v0.7.2
2025-02-06 15:23:50 +08:00
Michael Goin
449d1bce02
[Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793) 2025-02-05 23:16:20 -08:00
Harry Mellor
1a6fcad4c9
Improve TransformersModel UX (#12785) 2025-02-05 22:24:57 -08:00
Lu Fang
56534cd577
[Bugfix] Fix the test_ultravox.py's license (#12806)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-06 13:25:54 +08:00
Sumit Vij
d88506dda4
[Model] LoRA Support for Ultravox model (#11253) 2025-02-05 19:54:13 -08:00
Lu Fang
9cdea30b4f
[Misc][Easy] Remove the space from the file name 2025-02-05 19:23:35 -08:00
Lucas Wilkinson
76abd0c881
[Bugfix] Better FP8 supported defaults 2025-02-05 19:22:19 -08:00
Gregory Shtrasberg
5b19b93082
[ROCm][Kernel] Using the correct warp_size value 2025-02-05 19:15:08 -08:00
Cyrus Leung
75404d041b
[VLM] Update compatibility with transformers 4.49 2025-02-05 19:09:45 -08:00
Roger Wang
bf3b79efb8
[VLM] Qwen2.5-VL 2025-02-05 13:31:38 -08:00
Russell Bryant
9a5b1554b4
[Docs] Drop duplicate [source] links 2025-02-05 13:30:50 -08:00
Cyrus Leung
a4ce74c14a
[VLM] Use shared field to pass token ids to model 2025-02-05 13:30:46 -08:00
Rahul Tuli
3b2005e1db
Add: Support for Sparse24Bitmask Compressed Models 2025-02-05 13:30:43 -08:00
Sanju C Sudhakaran
af8486de49
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) 2025-02-05 13:29:45 -08:00
Chen Zhang
4c3aac51e1
Merging PR #12536
Merged via CLI script
2025-02-05 13:24:26 -08:00
youkaichao
bc1bdecebf
[core][distributed] exact ray placement control (#12732)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-06 02:03:19 +08:00
Akash kaothalkar
022bcc701a
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (#12546) 2025-02-04 23:11:02 -08:00
Michael Goin
c53dc466b1
[Doc] Remove performance warning for auto_awq.md (#12743) 2025-02-04 22:43:11 -08:00
Nick Hill
3d09e592a8
[V1][Misc] Shorten FinishReason enum and use constant strings (#12760) 2025-02-04 22:43:02 -08:00
Harry Mellor
fcf2e3d7fc
[Bugfix] Fix OpenVINO model runner (#12750) 2025-02-04 22:42:46 -08:00
Michael Goin
58b218d7ae
[Doc] Update PR Reminder with link to Developer Slack (#12748) 2025-02-04 22:42:09 -08:00
Kyle Sayers
7ff7a638b6
[Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-02-05 05:32:06 +00:00
Dipika Sikka
686006a220
[Misc] Bump the compressed-tensors version (#12736) 2025-02-04 20:44:48 -08:00
Isotr0py
98fd089fc9
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729) 2025-02-04 20:44:26 -08:00
Harry Mellor
249824c3bf
Refactor Linear handling in TransformersModel (#12727)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-05 04:31:12 +00:00
Aleksandr Malyshev
64862d106e
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-02-05 03:58:22 +00:00
Aviv Keshet
b3a0d01e45
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (#12368)
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
2025-02-04 18:46:26 -08:00
Lucas Wilkinson
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-02-04 18:22:24 -08:00
Mark McLoughlin
233df6f5c4
[V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-02-04 19:46:54 -05:00
Cyrus Leung
18016a5e62
[Bugfix] Fix CI failures for InternVL and Mantis models (#12728)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-04 23:54:23 +08:00
Sophie du Couédic
649550f27e
[Build] update requirements of no-device for plugin usage (#12630)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
2025-02-04 21:19:12 +08:00
Kero Liang
62467a834a
Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722)
Signed-off-by: imkero <kerorek@outlook.com>
2025-02-04 21:03:19 +08:00
Michael Greenbaum
6469038b14
[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689)
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com>
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>
2025-02-04 20:58:48 +08:00
Isotr0py
815079de8e
[VLM] merged multimodal processor and V1 support for idefics3 (#12660)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-02-04 20:00:51 +08:00
Woosuk Kwon
18a88fcccc
[V1] Remove scheduling constraint on partial requests (#12674)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-04 02:43:58 -08:00
Cyrus Leung
d1ca7df84d
[VLM] Merged multi-modal processor for InternVL-based models (#12553)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-02-04 16:44:52 +08:00
Jee Jee Li
96b23621c1
[Misc] Add BNB quantization for Whisper (#12381)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-04 16:27:36 +08:00
Hongxia Yang
c36ac98d01
[AMD][ROCm] Enable DeepSeek model on ROCm (#12662)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
2025-02-04 08:24:11 +00:00
Kyle Sayers
4896d0c2dd
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) 2025-02-03 23:27:11 -08:00
Thomas Parnell
bb392af434
[Doc] Replace ibm-fms with ibm-ai-platform (#12709)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-02-04 07:05:04 +00:00
Michael Goin
5d98d56089
Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-02-04 11:55:46 +08:00
Russell Bryant
73b35cca7f
[Core] Improve hash collision avoidance in prefix caching (#12621)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-03 16:28:20 -08:00
Cody Yu
5095e96606
[V1] Revert uncache_blocks and support recaching full blocks (#12415)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-03 15:04:53 -08:00
Cody Yu
cf58b9c4ca
[MISC] Remove model input dumping when exception (#12582)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-03 13:34:16 -08:00
kushanam
4797dad3ec
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) 2025-02-03 13:30:39 -08:00
Kyle Sayers
6dd5e52823
Squelch MLA warning for Compressed-Tensors Models (#12704)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-02-03 13:29:56 -08:00
Tyler Michael Smith
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-03 13:04:59 -08:00