SnowDist
|
a22dea54d3
|
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-05-30 19:24:41 -07:00 |
|
simon-mo
|
533c217792
|
Fix cutlass sm_90a vesrion in CMakeList
|
2024-05-31 02:13:01 +00:00 |
|
Alexander Matveev
|
6d21fa1cad
|
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136)
|
2024-05-30 21:02:11 -05:00 |
|
Robert Shaw
|
b35be5403f
|
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
|
2024-05-30 17:04:37 -07:00 |
|
Simon Mo
|
45a1a69b98
|
[Build] Disable sm_90a in cu11 (#5141)
|
2024-05-30 14:37:16 -07:00 |
|
Simon Mo
|
87a658c812
|
Bump version to v0.4.3 (#5046)
|
2024-05-30 11:13:46 -07:00 |
|
Chansung Park
|
429d89720e
|
add doc about serving option on dstack (#3074)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-05-30 10:11:07 -07:00 |
|
Cyrus Leung
|
a9bcc7afb2
|
[Doc] Use intersphinx and update entrypoints docs (#5125)
|
2024-05-30 09:59:23 -07:00 |
|
Hyunsung Lee
|
d79d9eaaff
|
[Misc] remove duplicate definition of seq_lens_tensor in model_runner.py (#5129)
|
2024-05-30 06:56:19 -07:00 |
|
youkaichao
|
f758505c73
|
[CI/Build] increase wheel size limit to 200 MB (#5130)
|
2024-05-30 06:29:48 -07:00 |
|
Robert Shaw
|
d910816c73
|
[Bugfix] Automatically Detect SparseML models (#5119)
|
2024-05-30 12:58:37 +00:00 |
|
Breno Faria
|
87d41c849d
|
[BUGFIX] [FRONTEND] Correct chat logprobs (#5029)
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
|
2024-05-30 02:52:14 -07:00 |
|
omkar kakarparthi
|
e07aff9e52
|
[CI/Build] Docker cleanup functionality for amd servers (#5112)
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Co-authored-by: omkarkakarparthi <okakarpa>
|
2024-05-30 03:27:39 +00:00 |
|
Alexander Matveev
|
5bf185a1c4
|
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108)
|
2024-05-30 00:30:18 +00:00 |
|
youkaichao
|
4fbcb0f27e
|
[Doc][Build] update after removing vllm-nccl (#5103)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-05-29 23:51:18 +00:00 |
|
Itay Etelis
|
7c3604fb68
|
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031)
|
2024-05-29 16:13:22 -07:00 |
|
Cyrus Leung
|
b1c255630d
|
[Core] Avoid the need to pass None values to Sequence.inputs (#5099)
|
2024-05-29 16:05:01 -07:00 |
|
Cyrus Leung
|
eb6c50cdc2
|
[Bugfix][CI/Build] Fix codespell failing to skip files in git diff (#5097)
|
2024-05-29 16:02:54 -07:00 |
|
Cyrus Leung
|
eecd864388
|
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators (#5096)
|
2024-05-29 16:02:25 -07:00 |
|
Ronen Schaffer
|
ae495c74ea
|
[Doc]Replace deprecated flag in readme (#4526)
|
2024-05-29 22:26:33 +00:00 |
|
afeldman-nm
|
4238bc82f2
|
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837)
|
2024-05-29 16:09:13 +00:00 |
|
youkaichao
|
594392d27a
|
[Core][Distributed] improve p2p access check (#4992)
|
2024-05-29 11:29:07 +00:00 |
|
Cyrus Leung
|
18c1f16d86
|
[Bugfix] Fix arguments passed to Sequence in stop checker test (#5092)
|
2024-05-29 07:16:41 +00:00 |
|
youkaichao
|
5bd3c65072
|
[Core][Optimization] remove vllm-nccl (#5091)
|
2024-05-29 05:13:52 +00:00 |
|
Marut Pandya
|
616e600e0b
|
[Misc] add gpu_memory_utilization arg (#5079)
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
|
2024-05-28 17:16:18 -07:00 |
|
Junichi Sato
|
dfba529b40
|
[Bugfix] Remove the last EOS token unless explicitly specified (#5077)
|
2024-05-28 17:15:35 -07:00 |
|
Cyrus Leung
|
5ae5ed1e60
|
[Core] Consolidate prompt arguments to LLM engines (#4328)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-05-28 13:29:31 -07:00 |
|
Simon Mo
|
290f4ada2b
|
[Docs] Add Dropbox as sponsors (#5089)
|
2024-05-28 10:29:09 -07:00 |
|
Divakar Verma
|
dd8de11f0a
|
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)
This PR adds Triton kernel configs for the MoE kernel for MI300X
|
2024-05-28 16:03:23 +00:00 |
|
Robert Shaw
|
9ba415588a
|
[BugFix] Fix Embedding Models with TP>1 (#5075)
|
2024-05-28 08:32:42 -07:00 |
|
Michał Moskal
|
d4f3985907
|
[Core] Sliding window for block manager v2 (#4545)
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
|
2024-05-28 11:07:07 +09:00 |
|
Isotr0py
|
890aa93d27
|
[Model] Add support for falcon-11B (#5069)
|
2024-05-27 16:41:43 -07:00 |
|
sasha0552
|
fbdb7b3ee2
|
[Core] Allow AQLM on Pascal (#5058)
|
2024-05-27 15:26:14 -07:00 |
|
Zhuohan Li
|
1102bef219
|
[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-05-27 15:18:17 -07:00 |
|
Roger Wang
|
f17a1a8f96
|
[Misc] Make Serving Benchmark More User-friendly (#5044)
|
2024-05-25 17:28:16 +00:00 |
|
Lily Liu
|
d5a1697772
|
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000)
|
2024-05-25 10:00:14 -07:00 |
|
youkaichao
|
325c119961
|
[Misc] add logging level env var (#5045)
|
2024-05-24 23:49:49 -07:00 |
|
Eric Xihui Lin
|
8e192ff967
|
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799)
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-05-24 22:00:52 -07:00 |
|
leiwen83
|
e64fde4b01
|
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-05-24 10:07:09 -07:00 |
|
Robert Shaw
|
919770957f
|
[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-05-24 12:28:27 +00:00 |
|
youkaichao
|
6a50f4cafa
|
[Doc] add ccache guide in doc (#5012)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-05-23 23:21:54 +00:00 |
|
Elisei Smirnov
|
e3470f8753
|
[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>
|
2024-05-23 22:04:24 +00:00 |
|
Dipika Sikka
|
a1242324c9
|
[Kernel] Initial Activation Quantization Support (#4525)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-05-23 21:29:18 +00:00 |
|
Murali Andoorveedu
|
5eda2ea02a
|
[Core][1/N] Support send/recv in PyNCCL Groups (#4988)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-05-23 09:54:48 -07:00 |
|
Letian Li
|
2ba80bed27
|
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined (#5009)
|
2024-05-23 09:08:58 -07:00 |
|
Alexander Matveev
|
6066253296
|
Marlin 24 prefill performance improvement (about 25% better on average) (#4983)
|
2024-05-23 02:39:27 -04:00 |
|
Cody Yu
|
ee3eea0a1b
|
[Misc] Take user preference in attention selector (#4960)
|
2024-05-23 07:55:56 +09:00 |
|
Philipp Moritz
|
a36de682d4
|
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991)
|
2024-05-22 22:26:56 +00:00 |
|
Nick Hill
|
eb6d3c264d
|
[Core] Eliminate parallel worker per-step task scheduling overhead (#4894)
|
2024-05-23 06:17:27 +09:00 |
|
raywanb
|
97b030005c
|
[Model] LoRA gptbigcode implementation (#3949)
|
2024-05-22 13:58:59 -07:00 |
|