xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-17 22:35:01 +08:00

Author	SHA1	Message	Date
Jee Li	d6f4bd7cdd	[Misc]Add customized information for models (#4132 )	2024-04-30 21:18:14 -07:00
Robert Shaw	111815d482	[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-04-30 21:46:12 +00:00
SangBin Cho	df29793dc7	[mypy][5/N] Support all typing on model executor (#4427 )	2024-04-28 19:01:26 -07:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
Robert Shaw	79a268c4ab	[BUG] fixed fp8 conflict with aqlm (#4307 ) Fixes fp8 iterface which broke in AQLM merge.	2024-04-23 18:26:33 -07:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Hui Liu	ba8dc958a3	[Minor] Fix bias in if to remove ambiguity (#3259 )	2024-03-13 09:16:55 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
Kunshang Ji	96b6f475dd	Remove hardcoded `device="cuda"` to support more devices (#2503 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>	2024-02-01 15:46:39 -08:00
Chenhui Zhang	f780504d12	fix weigit loading for GQA with TP (#2379 )	2024-01-15 15:43:59 -08:00
CHU Tianxiang	0fbfc4b81b	Add GPTQ support (#916 )	2023-12-15 03:04:22 -08:00
Zhuohan Li	7076fa1c9f	TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622 ) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - All models are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.	2023-11-15 22:50:41 -08:00

17 Commits