xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-01-08 11:00:57 +08:00

Author	SHA1	Message	Date
Simon Mo	a134ef6f5e	Support eos_token_id from generation_config.json (#4182 )	2024-04-19 04:13:36 +00:00
Cade Daniel	e95cd87959	[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894 )	2024-04-16 13:09:21 -07:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Noam Gat	05434764cd	LM Format Enforcer Guided Decoding Support (#3868 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-16 05:54:57 +00:00
Sanger Steel	711a000255	[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476 )	2024-04-13 17:13:01 -07:00
Nick Hill	e46a60aa4c	[BugFix] Fix handling of stop strings and stop token ids (#3672 )	2024-04-11 15:34:12 -07:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Matthias Gerstgrasser	aabe8f40f2	[Core] [Frontend] Make detokenization optional (#3749 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-04-03 21:52:18 -07:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
leiwen83	ad6eca408b	Fix early CUDA init via get_architecture_class_name import (#3770 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-02 11:56:26 -07:00
bigPYJ1151	0e3f06fe9c	[Hardware][Intel] Add CPU inference backend (#3634 ) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>	2024-04-01 22:07:30 -07:00
Roy	6110c39dc8	[BugFix] Fix tokenizer out of vocab size (#3685 )	2024-03-29 08:18:59 -07:00
yhu422	d8658c8cc1	Usage Stats Collection (#2852 )	2024-03-28 22:16:12 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Cade Daniel	14ccd94c89	[Core][Bugfix]Refactor block manager for better testability (#3492 )	2024-03-27 23:59:28 -07:00
Nick Hill	dfeb2ecc3a	[Misc] Include matched stop string/token in responses (#2976 ) Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>	2024-03-25 17:31:32 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
Travis Johnson	c13ad1b7bd	feat: implement the min_tokens sampling parameter (#3124 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-03-25 10:14:26 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Antoni Baum	bfdb1ba5c3	[Core] Improve detokenization performance for prefill (#3469 ) Co-authored-by: MeloYang <meloyang05@gmail.com>	2024-03-22 13:44:12 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
Roy	865732342b	[Misc][Log] Add log for tokenizer length not equal to vocabulary size (#3500 )	2024-03-21 18:07:48 +08:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Antoni Baum	fb96c1e98c	Asynchronous tokenization (#2879 )	2024-03-15 23:37:01 +00:00
Tao He	14b8ae02e7	Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220 ) Signed-off-by: Tao He <sighingnow@gmail.com> Co-authored-by: simon-mo <simon.mo@hey.com>	2024-03-15 18:25:43 +00:00
Zhuohan Li	4c922709b6	Add distributed model executor abstraction (#3191 )	2024-03-11 11:03:45 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Roy	9e8744a545	[BugFix] Fix get tokenizer when using ray (#3301 )	2024-03-10 19:17:16 -07:00
Nick Hill	2efce05dc3	[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-06 00:17:20 +00:00
Nick Hill	8999ec3c16	Store `eos_token_id` in `Sequence` for easy access (#3166 )	2024-03-05 15:35:43 -08:00
Antoni Baum	ff578cae54	Add health check, make async Engine more robust (#3015 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-04 22:01:40 +00:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Philipp Moritz	17c3103c56	Make it easy to profile workers with nsight (#3162 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-03 16:19:13 -08:00
Jason Cox	d65fac2738	Add vLLM version info to logs and openai API server (#3161 )	2024-03-02 21:00:29 -08:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Sherry	54d3544784	Fix: Output text is always truncated in some models (#3016 )	2024-03-01 07:52:22 +00:00
Nick Hill	29a8d6a554	[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#3099 )	2024-02-29 19:20:42 +00:00
Allen.Dou	9289e577ec	add cache_config's info to prometheus metrics. (#3100 )	2024-02-29 06:15:18 +00:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Roy	c1c0d00b88	Don't use cupy when `enforce_eager=True` (#3037 )	2024-02-26 17:33:38 -08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Nick Hill	185b2c29e2	Defensively copy `sampling_params` (#2881 ) If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request. Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059	2024-02-17 11:18:04 -08:00
Woosuk Kwon	a463c333dd	Use CuPy for CUDA graphs (#2811 )	2024-02-13 11:32:06 -08:00

1 2 3

113 Commits