Lucas Wilkinson
|
c786e757fa
|
[Attention] Use FA3 for MLA on Hopper (#12807)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-02-06 11:43:12 +00:00 |
|
Simon Mo
|
cefd56ee35
|
[Docs] Add Google Cloud Slides (#12814)
|
2025-02-06 01:02:38 -08:00 |
|
Dipika Sikka
|
7ca9934fe7
|
[Misc] Update w2 scale loading for GPTQMarlinMoE (#12757)
|
2025-02-06 01:02:14 -08:00 |
|
youkaichao
|
0408efc6d0
|
[Misc] Improve error message for incorrect pynvml (#12809)
Signed-off-by: youkaichao <youkaichao@gmail.com>
v0.7.2
|
2025-02-06 15:23:50 +08:00 |
|
Michael Goin
|
449d1bce02
|
[Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793)
|
2025-02-05 23:16:20 -08:00 |
|
Harry Mellor
|
1a6fcad4c9
|
Improve TransformersModel UX (#12785)
|
2025-02-05 22:24:57 -08:00 |
|
Lu Fang
|
56534cd577
|
[Bugfix] Fix the test_ultravox.py's license (#12806)
Signed-off-by: Lu Fang <lufang@fb.com>
|
2025-02-06 13:25:54 +08:00 |
|
Sumit Vij
|
d88506dda4
|
[Model] LoRA Support for Ultravox model (#11253)
|
2025-02-05 19:54:13 -08:00 |
|
Lu Fang
|
9cdea30b4f
|
[Misc][Easy] Remove the space from the file name
|
2025-02-05 19:23:35 -08:00 |
|
Lucas Wilkinson
|
76abd0c881
|
[Bugfix] Better FP8 supported defaults
|
2025-02-05 19:22:19 -08:00 |
|
Gregory Shtrasberg
|
5b19b93082
|
[ROCm][Kernel] Using the correct warp_size value
|
2025-02-05 19:15:08 -08:00 |
|
Cyrus Leung
|
75404d041b
|
[VLM] Update compatibility with transformers 4.49
|
2025-02-05 19:09:45 -08:00 |
|
Roger Wang
|
bf3b79efb8
|
[VLM] Qwen2.5-VL
|
2025-02-05 13:31:38 -08:00 |
|
Russell Bryant
|
9a5b1554b4
|
[Docs] Drop duplicate [source] links
|
2025-02-05 13:30:50 -08:00 |
|
Cyrus Leung
|
a4ce74c14a
|
[VLM] Use shared field to pass token ids to model
|
2025-02-05 13:30:46 -08:00 |
|
Rahul Tuli
|
3b2005e1db
|
Add: Support for Sparse24Bitmask Compressed Models
|
2025-02-05 13:30:43 -08:00 |
|
Sanju C Sudhakaran
|
af8486de49
|
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU)
|
2025-02-05 13:29:45 -08:00 |
|
Chen Zhang
|
4c3aac51e1
|
Merging PR #12536
Merged via CLI script
|
2025-02-05 13:24:26 -08:00 |
|
youkaichao
|
bc1bdecebf
|
[core][distributed] exact ray placement control (#12732)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-02-06 02:03:19 +08:00 |
|
Akash kaothalkar
|
022bcc701a
|
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (#12546)
|
2025-02-04 23:11:02 -08:00 |
|
Michael Goin
|
c53dc466b1
|
[Doc] Remove performance warning for auto_awq.md (#12743)
|
2025-02-04 22:43:11 -08:00 |
|
Nick Hill
|
3d09e592a8
|
[V1][Misc] Shorten FinishReason enum and use constant strings (#12760)
|
2025-02-04 22:43:02 -08:00 |
|
Harry Mellor
|
fcf2e3d7fc
|
[Bugfix] Fix OpenVINO model runner (#12750)
|
2025-02-04 22:42:46 -08:00 |
|
Michael Goin
|
58b218d7ae
|
[Doc] Update PR Reminder with link to Developer Slack (#12748)
|
2025-02-04 22:42:09 -08:00 |
|
Kyle Sayers
|
7ff7a638b6
|
[Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2025-02-05 05:32:06 +00:00 |
|
Dipika Sikka
|
686006a220
|
[Misc] Bump the compressed-tensors version (#12736)
|
2025-02-04 20:44:48 -08:00 |
|
Isotr0py
|
98fd089fc9
|
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729)
|
2025-02-04 20:44:26 -08:00 |
|
Harry Mellor
|
249824c3bf
|
Refactor Linear handling in TransformersModel (#12727)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-02-05 04:31:12 +00:00 |
|
Aleksandr Malyshev
|
64862d106e
|
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
|
2025-02-05 03:58:22 +00:00 |
|
Aviv Keshet
|
b3a0d01e45
|
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (#12368)
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
|
2025-02-04 18:46:26 -08:00 |
|
Lucas Wilkinson
|
75e94309e8
|
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
|
2025-02-04 18:22:24 -08:00 |
|
Mark McLoughlin
|
233df6f5c4
|
[V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-02-04 19:46:54 -05:00 |
|
Cyrus Leung
|
18016a5e62
|
[Bugfix] Fix CI failures for InternVL and Mantis models (#12728)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-02-04 23:54:23 +08:00 |
|
Sophie du Couédic
|
649550f27e
|
[Build] update requirements of no-device for plugin usage (#12630)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
|
2025-02-04 21:19:12 +08:00 |
|
Kero Liang
|
62467a834a
|
Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722)
Signed-off-by: imkero <kerorek@outlook.com>
|
2025-02-04 21:03:19 +08:00 |
|
Michael Greenbaum
|
6469038b14
|
[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689)
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com>
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>
|
2025-02-04 20:58:48 +08:00 |
|
Isotr0py
|
815079de8e
|
[VLM] merged multimodal processor and V1 support for idefics3 (#12660)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-02-04 20:00:51 +08:00 |
|
Woosuk Kwon
|
18a88fcccc
|
[V1] Remove scheduling constraint on partial requests (#12674)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-02-04 02:43:58 -08:00 |
|
Cyrus Leung
|
d1ca7df84d
|
[VLM] Merged multi-modal processor for InternVL-based models (#12553)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
|
2025-02-04 16:44:52 +08:00 |
|
Jee Jee Li
|
96b23621c1
|
[Misc] Add BNB quantization for Whisper (#12381)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-02-04 16:27:36 +08:00 |
|
Hongxia Yang
|
c36ac98d01
|
[AMD][ROCm] Enable DeepSeek model on ROCm (#12662)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
|
2025-02-04 08:24:11 +00:00 |
|
Kyle Sayers
|
4896d0c2dd
|
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711)
|
2025-02-03 23:27:11 -08:00 |
|
Thomas Parnell
|
bb392af434
|
[Doc] Replace ibm-fms with ibm-ai-platform (#12709)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-02-04 07:05:04 +00:00 |
|
Michael Goin
|
5d98d56089
|
Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710)
Signed-off-by: mgoin <michael@neuralmagic.com>
|
2025-02-04 11:55:46 +08:00 |
|
Russell Bryant
|
73b35cca7f
|
[Core] Improve hash collision avoidance in prefix caching (#12621)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2025-02-03 16:28:20 -08:00 |
|
Cody Yu
|
5095e96606
|
[V1] Revert uncache_blocks and support recaching full blocks (#12415)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-03 15:04:53 -08:00 |
|
Cody Yu
|
cf58b9c4ca
|
[MISC] Remove model input dumping when exception (#12582)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
|
2025-02-03 13:34:16 -08:00 |
|
kushanam
|
4797dad3ec
|
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707)
|
2025-02-03 13:30:39 -08:00 |
|
Kyle Sayers
|
6dd5e52823
|
Squelch MLA warning for Compressed-Tensors Models (#12704)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
2025-02-03 13:29:56 -08:00 |
|
Tyler Michael Smith
|
c11de33dad
|
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2025-02-03 13:04:59 -08:00 |
|