xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-04-26 18:47:07 +08:00

Author	SHA1	Message	Date
Roger Wang	7a9cb294ae	[Frontend] Add OpenAI Vision API Support (#5237 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-07 11:23:32 -07:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
limingshu	dc49fb892c	Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296 )	2024-06-07 13:35:42 +00:00
Antoni Baum	18a277b52d	Remove Ray health check (#4693 )	2024-06-07 10:01:56 +00:00
Tyler Michael Smith	8d75fe48ca	[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183 ) Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.	2024-06-07 08:42:35 +00:00
youkaichao	388596c914	[Misc][Utils] allow get_open_port to be called for multiple times (#5333 )	2024-06-06 22:15:11 -07:00
Itay Etelis	baa15a9ec3	[Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135 )	2024-06-07 03:29:24 +00:00
Jie Fu (傅杰)	15063741e3	[Misc] Missing error message for custom ops import (#5282 )	2024-06-06 20:17:21 -07:00
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Antoni Baum	a31cab7556	[Core] Avoid copying prompt/output tokens if no penalties are used (#5289 )	2024-06-06 18:12:00 -07:00
Matthew Goldey	828da0d44e	[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300 )	2024-06-06 15:48:13 -05:00
Philipp Moritz	abe855d637	[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294 )	2024-06-06 09:29:29 -07:00
liuyhwangyh	4efff036f0	Bugfix: fix broken of download models from modelscope (#5233 ) Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>	2024-06-06 09:28:10 -07:00
Cyrus Leung	89c920785f	[CI/Build] Update vision tests (#5307 )	2024-06-06 05:17:18 -05:00
Breno Faria	7b0a0dfb22	[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>	2024-06-05 16:49:12 -07:00
Woosuk Kwon	6a7c7711a2	[Misc] Skip for logits_scale == 1.0 (#5291 )	2024-06-05 15:19:02 -07:00
Alex Wu	0f83ddd4d7	[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290 )	2024-06-05 15:18:12 -07:00
Michael Goin	065aff6c16	[Bugfix] Make EngineArgs use named arguments for config construction (#5285 )	2024-06-05 15:16:56 -07:00
Nick Hill	3d33e372a1	[BugFix] Fix log message about default max model length (#5284 )	2024-06-05 14:53:16 -07:00
Nick Hill	faf71bcd4b	[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252 )	2024-06-05 14:53:05 -07:00
Philipp Moritz	51a08e7d8f	[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238 )	2024-06-05 10:59:14 -07:00
DriverSong	eb8fcd2666	[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207 ) Co-authored-by: qiujiawei9 <qiujiawei9@jd.com>	2024-06-05 10:59:02 -07:00
Cody Yu	5563a4dea8	[Model] Correct Mixtral FP8 checkpoint loading (#5231 )	2024-06-05 10:58:50 -07:00
tomeras91	f0a500545f	[Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False) (#5278 )	2024-06-05 09:32:58 -07:00
Woosuk Kwon	c65146e75e	[Misc] Fix docstring of get_attn_backend (#5271 )	2024-06-05 09:18:59 -07:00
Woosuk Kwon	41ca62cf03	[Misc] Add CustomOp interface for device portability (#5255 )	2024-06-05 09:18:19 -07:00
zifeitong	974fc9b845	[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226 )	2024-06-04 19:37:28 -07:00
zifeitong	a58f24e590	[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229 )	2024-06-03 20:55:50 -07:00
Woosuk Kwon	3a434b07ed	[Kernel] Enhance MoE benchmarking & tuning script (#4921 )	2024-06-03 20:06:59 -07:00
Toshiki Kataoka	06b2550cbb	[Bugfix] Support `prompt_logprobs==0` (#5217 )	2024-06-03 17:59:30 -07:00
Breno Faria	f775a07e30	[FRONTEND] OpenAI `tools` support named functions (#5032 )	2024-06-03 18:25:29 -05:00
Kaiyang Chen	10c38e3e46	[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834 )	2024-06-03 13:37:11 -07:00
Tyler Michael Smith	cbb2f59cc8	[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )	2024-06-03 09:52:30 -07:00
Antoni Baum	0ab278ca31	[Core] Remove unnecessary copies in flash attn backend (#5138 )	2024-06-03 09:39:31 -07:00
Cyrus Leung	7a64d24aad	[Core] Support image processor (#4197 )	2024-06-02 22:56:41 -07:00
Divakar Verma	a66cf40b20	[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927 ) This PR enables the fused topk_softmax kernel used in moe layer for HIP	2024-06-02 14:13:26 -07:00
Avinash Raj	f790ad3c50	[Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643 )	2024-06-02 08:06:13 +00:00
Robert Shaw	044793d8df	[BugFix] Prevent `LLM.encode` for non-generation Models (#5184 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-06-01 23:35:41 +00:00
Zhuohan Li	8279078e21	[Bugfix] Remove deprecated @abstractproperty (#5174 )	2024-06-01 22:40:25 +00:00
chenqianfzh	b9c0605a8e	[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776 )	2024-06-01 14:51:10 -06:00
Nadav Shmayovits	37464a0f74	[Bugfix] Fix call to init_logger in openai server (#4765 )	2024-06-01 17:18:50 +00:00
Ye Cao	c354072828	[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151 ) Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>	2024-06-01 17:11:22 +00:00
Tyler Michael Smith	260d119e86	[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137 )	2024-06-01 06:45:32 +00:00
Cody Yu	e9899fb7a4	[Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039 )	2024-05-31 14:29:19 -07:00
functionxu123	a377f0bd5e	[Misc]: optimize eager mode host time (#4196 ) Co-authored-by: xuhao <xuhao@cambricon.com>	2024-05-31 13:14:50 +08:00
SnowDist	a22dea54d3	[Model] Support MAP-NEO model (#5081 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-30 19:24:41 -07:00
Robert Shaw	b35be5403f	[Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120 )	2024-05-30 17:04:37 -07:00
Simon Mo	87a658c812	Bump version to v0.4.3 (#5046 )	2024-05-30 11:13:46 -07:00
Cyrus Leung	a9bcc7afb2	[Doc] Use intersphinx and update entrypoints docs (#5125 )	2024-05-30 09:59:23 -07:00
Hyunsung Lee	d79d9eaaff	[Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129 )	2024-05-30 06:56:19 -07:00

1 2 3 4 5 ...

940 Commits