Roger Wang
|
7a9cb294ae
|
[Frontend] Add OpenAI Vision API Support (#5237)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-06-07 11:23:32 -07:00 |
|
Dipika Sikka
|
ca3ea51bde
|
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-06-07 09:36:26 -07:00 |
|
limingshu
|
dc49fb892c
|
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296)
|
2024-06-07 13:35:42 +00:00 |
|
Antoni Baum
|
18a277b52d
|
Remove Ray health check (#4693)
|
2024-06-07 10:01:56 +00:00 |
|
Tyler Michael Smith
|
8d75fe48ca
|
[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183)
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8
see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
|
2024-06-07 08:42:35 +00:00 |
|
youkaichao
|
388596c914
|
[Misc][Utils] allow get_open_port to be called for multiple times (#5333)
|
2024-06-06 22:15:11 -07:00 |
|
Itay Etelis
|
baa15a9ec3
|
[Feature][Frontend]: Add support for stream_options in ChatCompletionRequest (#5135)
|
2024-06-07 03:29:24 +00:00 |
|
Jie Fu (傅杰)
|
15063741e3
|
[Misc] Missing error message for custom ops import (#5282)
|
2024-06-06 20:17:21 -07:00 |
|
Antoni Baum
|
ccdc490dda
|
[Core] Change LoRA embedding sharding to support loading methods (#5038)
|
2024-06-06 19:07:57 -07:00 |
|
Antoni Baum
|
a31cab7556
|
[Core] Avoid copying prompt/output tokens if no penalties are used (#5289)
|
2024-06-06 18:12:00 -07:00 |
|
Matthew Goldey
|
828da0d44e
|
[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300)
|
2024-06-06 15:48:13 -05:00 |
|
Philipp Moritz
|
abe855d637
|
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294)
|
2024-06-06 09:29:29 -07:00 |
|
liuyhwangyh
|
4efff036f0
|
Bugfix: fix broken of download models from modelscope (#5233)
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
|
2024-06-06 09:28:10 -07:00 |
|
Cyrus Leung
|
89c920785f
|
[CI/Build] Update vision tests (#5307)
|
2024-06-06 05:17:18 -05:00 |
|
Breno Faria
|
7b0a0dfb22
|
[Frontend][Core] Update Outlines Integration from FSM to Guide (#4109)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
|
2024-06-05 16:49:12 -07:00 |
|
Woosuk Kwon
|
6a7c7711a2
|
[Misc] Skip for logits_scale == 1.0 (#5291)
|
2024-06-05 15:19:02 -07:00 |
|
Alex Wu
|
0f83ddd4d7
|
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290)
|
2024-06-05 15:18:12 -07:00 |
|
Michael Goin
|
065aff6c16
|
[Bugfix] Make EngineArgs use named arguments for config construction (#5285)
|
2024-06-05 15:16:56 -07:00 |
|
Nick Hill
|
3d33e372a1
|
[BugFix] Fix log message about default max model length (#5284)
|
2024-06-05 14:53:16 -07:00 |
|
Nick Hill
|
faf71bcd4b
|
[Speculative Decoding] Add ProposerWorkerBase abstract class (#5252)
|
2024-06-05 14:53:05 -07:00 |
|
Philipp Moritz
|
51a08e7d8f
|
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238)
|
2024-06-05 10:59:14 -07:00 |
|
DriverSong
|
eb8fcd2666
|
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207)
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com>
|
2024-06-05 10:59:02 -07:00 |
|
Cody Yu
|
5563a4dea8
|
[Model] Correct Mixtral FP8 checkpoint loading (#5231)
|
2024-06-05 10:58:50 -07:00 |
|
tomeras91
|
f0a500545f
|
[Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) (#5278)
|
2024-06-05 09:32:58 -07:00 |
|
Woosuk Kwon
|
c65146e75e
|
[Misc] Fix docstring of get_attn_backend (#5271)
|
2024-06-05 09:18:59 -07:00 |
|
Woosuk Kwon
|
41ca62cf03
|
[Misc] Add CustomOp interface for device portability (#5255)
|
2024-06-05 09:18:19 -07:00 |
|
zifeitong
|
974fc9b845
|
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226)
|
2024-06-04 19:37:28 -07:00 |
|
zifeitong
|
a58f24e590
|
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229)
|
2024-06-03 20:55:50 -07:00 |
|
Woosuk Kwon
|
3a434b07ed
|
[Kernel] Enhance MoE benchmarking & tuning script (#4921)
|
2024-06-03 20:06:59 -07:00 |
|
Toshiki Kataoka
|
06b2550cbb
|
[Bugfix] Support prompt_logprobs==0 (#5217)
|
2024-06-03 17:59:30 -07:00 |
|
Breno Faria
|
f775a07e30
|
[FRONTEND] OpenAI tools support named functions (#5032)
|
2024-06-03 18:25:29 -05:00 |
|
Kaiyang Chen
|
10c38e3e46
|
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
|
2024-06-03 13:37:11 -07:00 |
|
Tyler Michael Smith
|
cbb2f59cc8
|
[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159)
|
2024-06-03 09:52:30 -07:00 |
|
Antoni Baum
|
0ab278ca31
|
[Core] Remove unnecessary copies in flash attn backend (#5138)
|
2024-06-03 09:39:31 -07:00 |
|
Cyrus Leung
|
7a64d24aad
|
[Core] Support image processor (#4197)
|
2024-06-02 22:56:41 -07:00 |
|
Divakar Verma
|
a66cf40b20
|
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927)
This PR enables the fused topk_softmax kernel used in moe layer for HIP
|
2024-06-02 14:13:26 -07:00 |
|
Avinash Raj
|
f790ad3c50
|
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643)
|
2024-06-02 08:06:13 +00:00 |
|
Robert Shaw
|
044793d8df
|
[BugFix] Prevent LLM.encode for non-generation Models (#5184)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-06-01 23:35:41 +00:00 |
|
Zhuohan Li
|
8279078e21
|
[Bugfix] Remove deprecated @abstractproperty (#5174)
|
2024-06-01 22:40:25 +00:00 |
|
chenqianfzh
|
b9c0605a8e
|
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
|
2024-06-01 14:51:10 -06:00 |
|
Nadav Shmayovits
|
37464a0f74
|
[Bugfix] Fix call to init_logger in openai server (#4765)
|
2024-06-01 17:18:50 +00:00 |
|
Ye Cao
|
c354072828
|
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151)
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
|
2024-06-01 17:11:22 +00:00 |
|
Tyler Michael Smith
|
260d119e86
|
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137)
|
2024-06-01 06:45:32 +00:00 |
|
Cody Yu
|
e9899fb7a4
|
[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039)
|
2024-05-31 14:29:19 -07:00 |
|
functionxu123
|
a377f0bd5e
|
[Misc]: optimize eager mode host time (#4196)
Co-authored-by: xuhao <xuhao@cambricon.com>
|
2024-05-31 13:14:50 +08:00 |
|
SnowDist
|
a22dea54d3
|
[Model] Support MAP-NEO model (#5081)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-05-30 19:24:41 -07:00 |
|
Robert Shaw
|
b35be5403f
|
[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120)
|
2024-05-30 17:04:37 -07:00 |
|
Simon Mo
|
87a658c812
|
Bump version to v0.4.3 (#5046)
|
2024-05-30 11:13:46 -07:00 |
|
Cyrus Leung
|
a9bcc7afb2
|
[Doc] Use intersphinx and update entrypoints docs (#5125)
|
2024-05-30 09:59:23 -07:00 |
|
Hyunsung Lee
|
d79d9eaaff
|
[Misc] remove duplicate definition of seq_lens_tensor in model_runner.py (#5129)
|
2024-05-30 06:56:19 -07:00 |
|