xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-10 11:26:15 +08:00

Author	SHA1	Message	Date
Jee Li	1096717ae9	[Core] Support LoRA on quantized models (#4012 )	2024-04-11 21:02:44 -07:00
Nick Hill	e46a60aa4c	[BugFix] Fix handling of stop strings and stop token ids (#3672 )	2024-04-11 15:34:12 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Dylan Hawk	95e7d4a97c	Fix echo/logprob OpenAI completion bug (#3441 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-04-11 22:15:50 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
Kunshang Ji	e9da5a40c6	[Misc] Add indirection layer for custom ops (#3913 )	2024-04-10 20:26:07 -07:00
SangBin Cho	e42df7227d	[Test] Add xformer and flash attn tests (#3961 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-11 03:09:50 +00:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
Travis Johnson	0258b7a94b	[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-04-10 01:39:56 -07:00
胡译文	b3104b2a10	[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899 )	2024-04-10 00:09:36 -07:00
Jee Li	11dd6ebb89	[Misc] Avoid loading incorrect LoRA config (#3777 )	2024-04-09 19:47:15 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
youkaichao	95baec828f	[Core] enable out-of-tree model register (#3871 )	2024-04-06 17:11:41 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Cade Daniel	e5043a3e75	[Misc] Add pytest marker to opt-out of global test cleanup (#3863 )	2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser	aabe8f40f2	[Core] [Frontend] Make detokenization optional (#3749 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-04-03 21:52:18 -07:00
Michael Feil	537ee25f43	[Core] Enable hf_transfer by default if available (#3817 )	2024-04-04 04:02:43 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
Cade Daniel	eb69d68804	[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783 )	2024-04-02 00:49:51 +00:00
Qubitium	7d4e1b85e7	[Misc] Add support for new autogptq checkpoint_format (#3689 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-01 19:32:01 -04:00
Cade Daniel	93deb0b38f	[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250 )	2024-04-01 22:55:24 +00:00
Nick Hill	49782fcb76	[Misc] Some minor simplifications to detokenization logic (#3670 ) Some simplifications made for clarity. Also moves detokenization-related functions from tokenizer.py to detokenizer.py.	2024-04-01 13:22:06 -07:00
Robert Shaw	563c1d7ec5	[CI/Build] Make Marlin Tests Green (#3753 )	2024-03-30 19:18:34 -07:00
mawong-amd	b6d103542c	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
Roy	f510395bbf	[BugFix][Frontend] Fix completion logprobs=0 error (#3731 )	2024-03-29 09:38:21 -07:00
Roy	6110c39dc8	[BugFix] Fix tokenizer out of vocab size (#3685 )	2024-03-29 08:18:59 -07:00
youkaichao	756b30a5f3	[Core][Test] move local_rank to the last arg with default value(#3711 ) [Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)	2024-03-28 21:19:45 -07:00
SangBin Cho	26422e477b	[Test] Make model tests run again and remove --forked from pytest (#3631 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-28 21:06:40 -07:00
Roy	515386ef3c	[Core] Support multi-node inference(eager and cuda graph) (#3686 )	2024-03-28 15:01:55 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Cade Daniel	14ccd94c89	[Core][Bugfix]Refactor block manager for better testability (#3492 )	2024-03-27 23:59:28 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
youkaichao	8f44facddd	[Core] remove cupy dependency (#3625 )	2024-03-27 00:33:26 -07:00
Jee Li	566b57c5c4	[Kernel] support non-zero cuda devices in punica kernels (#3636 )	2024-03-27 00:37:42 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
Nick Hill	dfeb2ecc3a	[Misc] Include matched stop string/token in responses (#2976 ) Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>	2024-03-25 17:31:32 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
Simon Mo	f408d05c52	hotfix isort on logprobs ranks pr (#3622 )	2024-03-25 11:55:46 -07:00
Dylan Hawk	0b4997e05c	[Bugfix] API stream returning two stops (#3450 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-03-25 10:14:34 -07:00
Travis Johnson	c13ad1b7bd	feat: implement the min_tokens sampling parameter (#3124 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-03-25 10:14:26 -07:00
Swapnil Parekh	819924e749	[Core] Adding token ranks along with logprobs (#3516 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>	2024-03-25 10:13:10 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00
youkaichao	837e185142	[CI/Build] fix flaky test (#3602 )	2024-03-24 17:43:05 -07:00
youkaichao	8b268a46a7	[CI] typo fix: is_hip --> is_hip() (#3595 )	2024-03-24 16:03:06 -07:00
Nick Hill	41deac4a3d	[BugFix] 1D query fix for MoE models (#3597 )	2024-03-24 16:00:16 -07:00
Antoni Baum	bfdb1ba5c3	[Core] Improve detokenization performance for prefill (#3469 ) Co-authored-by: MeloYang <meloyang05@gmail.com>	2024-03-22 13:44:12 -07:00

... 12 13 14 15 16 ...

845 Commits