xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-13 16:35:30 +08:00

Author	SHA1	Message	Date
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
TianYu GUO	e67c295b0c	[Bugfix] fix automatic prefix args and add log info (#3608 )	2024-03-25 05:35:22 -07:00
Thomas Parnell	cf2f084d56	Dynamic scheduler delay to improve ITL performance (#3279 ) Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>	2024-03-22 12:28:14 -07:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Antoni Baum	fb96c1e98c	Asynchronous tokenization (#2879 )	2024-03-15 23:37:01 +00:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Philipp Moritz	17c3103c56	Make it easy to profile workers with nsight (#3162 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-03 16:19:13 -08:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Mark Mozolewski	786b7f18a5	Add code-revision config argument for Hugging Face Hub (#2892 )	2024-02-17 22:36:53 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Hanzhi Zhou	380170038e	Implement custom all reduce kernels (#2192 )	2024-01-27 12:46:35 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Suhong Moon	290e015c6c	Update Help Text for --gpu-memory-utilization Argument (#2183 )	2023-12-18 11:33:24 -08:00
JohnSaxon	bbe4466fd9	[Minor] Fix typo (#2166 ) Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com>	2023-12-17 23:28:49 -08:00
Woosuk Kwon	30fb0956df	[Minor] Add more detailed explanation on `quantization` argument (#2145 )	2023-12-17 01:56:16 -08:00
Woosuk Kwon	37ca558103	Optimize model execution with CUDA graph (#1926 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-12-16 21:12:08 -08:00
CHU Tianxiang	0fbfc4b81b	Add GPTQ support (#916 )	2023-12-15 03:04:22 -08:00
Woosuk Kwon	27feead2f8	Refactor Worker & InputMetadata (#1843 )	2023-11-29 22:16:37 -08:00
Casper	a921d8be9d	[DOCS] Add engine args documentation (#1741 )	2023-11-22 12:31:27 -08:00
boydfd	4bb6b67188	fix RAM OOM when load large models in tensor parallel mode. (#1395 ) Co-authored-by: ran_lin <rlin@thoughtworks.com>	2023-11-20 19:02:42 -08:00
chooper1	1f24755bf8	Support SqueezeLLM (#1326 ) Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-10-21 23:14:59 -07:00
Woosuk Kwon	c1376e0f82	Change scheduler & input tensor shape (#1381 )	2023-10-16 17:48:42 -07:00
Federico Cassano	66d18a7fb0	add support for tokenizer revision (#1163 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-10-02 19:19:46 -07:00
Woosuk Kwon	f936657eb6	Provide default max model length (#1224 )	2023-09-28 14:44:02 -07:00
Chris Bamford	bb1ba58f06	[Mistral] Mistral-7B-v0.1 support (#1196 ) Co-authored-by: timlacroix <t@mistral.ai>	2023-09-28 10:41:03 -07:00
Woosuk Kwon	a19bc5c628	Automatically configure `max_num_batched_tokens` (#1198 )	2023-09-27 16:34:00 -07:00
Woosuk Kwon	1ac4ccf73c	Add float16 and float32 (#1115 )	2023-09-21 00:52:47 -07:00
Woosuk Kwon	e3e79e9e8a	Implement AWQ quantization support for LLaMA (#1032 ) Co-authored-by: Robert Irvine <robert@seamlessml.com> Co-authored-by: root <rirv938@gmail.com> Co-authored-by: Casper <casperbh.96@gmail.com> Co-authored-by: julian-q <julianhquevedo@gmail.com>	2023-09-16 00:03:37 -07:00
Jasmond L	ab019eea75	Add Model Revision Support (#1014 ) Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-09-13 15:20:02 -07:00
Antoni Baum	0bb1e885a0	Make `max_model_len` configurable (#972 )	2023-09-12 16:29:19 -07:00
leiwen83	d6545ad22e	add option to shorten prompt print in log (#991 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2023-09-12 15:10:14 -07:00
Zhuohan Li	c957c741d9	Enable safetensors loading for all models (#974 )	2023-09-07 15:49:52 -07:00
Zhuohan Li	58a072be15	[Fix] Add model sequence length into model config (#575 )	2023-07-25 23:46:30 -07:00
Lily Liu	b4b195b360	fix max seq len (#489 )	2023-07-17 23:20:20 -07:00
codethazine	a945fcc2ae	Add trust-remote-code flag to handle remote tokenizers (#364 )	2023-07-07 11:04:58 -07:00
Zhuohan Li	d6fa1be3a8	[Quality] Add code formatter and linter (#326 )	2023-07-03 11:31:55 -07:00
Lily Liu	dafd924c1f	Raise error for long prompt (#273 )	2023-06-30 18:48:49 -07:00
Woosuk Kwon	998d9d1509	[Tokenizer] Add tokenizer mode (#298 )	2023-06-28 14:19:22 -07:00
Woosuk Kwon	4338cc4750	[Tokenizer] Add an option to specify tokenizer (#284 )	2023-06-28 09:46:58 -07:00
Zhuohan Li	bf5f121c02	Reduce GPU memory utilization to make sure OOM doesn't happen (#153 )	2023-06-18 17:33:50 +08:00
Woosuk Kwon	0b98ba15c7	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00

... 2 3 4 5 6

294 Commits