xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-06-25 08:17:13 +08:00

Author	SHA1	Message	Date
Michael Goin	44f26a9466	[Model] Align nemotron config with final HF state and fix lm-eval-small (#7611 )	2024-08-16 15:56:34 -07:00
bnellnm	37fd47e780	[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596 )	2024-08-16 14:00:11 -07:00
Michael Goin	855866caa9	[Kernel] Add tuned triton configs for ExpertsInt8 (#7601 )	2024-08-16 11:37:01 -07:00
Mor Zusman	7fc23be81c	[Kernel] W8A16 Int8 inside FusedMoE (#7415 )	2024-08-16 10:06:51 -07:00
Charlie Fu	e837b624f2	[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210 )	2024-08-16 10:06:30 -07:00
Michael Goin	21313e09e3	[Bugfix] Fix default weight loading for scalars (#7534 )	2024-08-15 13:10:22 -07:00
Kyle Sayers	f55a9aea45	[Misc] Revert `compressed-tensors` code reuse (#7521 )	2024-08-14 15:07:37 -07:00
Cyrus Leung	3f674a49b5	[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126 )	2024-08-14 17:55:42 +00:00
Chang Su	c134a46402	Fix empty output when temp is too low (#2937 ) Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-08-14 05:31:44 +00:00
youkaichao	16422ea76f	[misc][plugin] add plugin system implementation (#7426 )	2024-08-13 16:24:17 -07:00
Kyle Sayers	373538f973	[Misc] `compressed-tensors` code reuse (#7277 )	2024-08-13 19:05:15 -04:00
Dipika Sikka	b1e5afc3e7	[Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` (#7422 )	2024-08-13 17:08:20 -04:00
Dipika Sikka	d3bdfd3ab9	[Misc] Update Fused MoE weight loading (#7334 )	2024-08-13 14:57:45 -04:00
Dipika Sikka	fb377d7e74	[Misc] Update `gptq_marlin` to use new vLLMParameters (#7281 )	2024-08-13 14:30:11 -04:00
Peter Salas	00c3d68e45	[Frontend][Core] Add plumbing to support audio language models (#7446 )	2024-08-13 17:39:33 +00:00
youkaichao	4d2dc5072b	[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102 )	2024-08-13 00:16:42 -07:00
Cyrus Leung	7025b11d94	[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410 )	2024-08-13 05:33:41 +00:00
Cyrus Leung	9ba85bc152	[mypy] Misc. typing improvements (#7417 )	2024-08-13 09:20:20 +08:00
Roger Wang	e6e42e4b17	[Core][VLM] Support image embeddings as input (#6613 )	2024-08-12 16:16:06 +08:00
Isotr0py	4c5d8e8ea9	[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392 )	2024-08-10 16:19:33 +00:00
Dipika Sikka	5c6c54d67a	[Bugfix] Fix `PerTensorScaleParameter` weight loading for fused models (#7376 )	2024-08-09 21:23:46 +00:00
Mor Zusman	07ab160741	[Model][Jamba] Mamba cache single buffer (#6739 ) Co-authored-by: Mor Zusman <morz@ai21.com>	2024-08-09 10:07:06 -04:00
William Lin	57b7be0e1c	[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971 )	2024-08-09 05:42:45 +00:00
Travis Johnson	99b4cf5f23	[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-08-08 22:08:46 -07:00
Alexander Matveev	e02ac55617	[Performance] Optimize e2e overheads: Reduce python allocations (#7162 )	2024-08-08 21:34:28 -07:00
Cyrus Leung	7eb4a51c5f	[Core] Support serving encoder/decoder models (#7258 )	2024-08-09 10:39:41 +08:00
Siyuan Liu	0fa14907da	[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005 )	2024-08-08 18:35:49 -07:00
Isotr0py	8334c39f37	[Bugfix] Fix new Llama3.1 GGUF model loading (#7269 )	2024-08-08 13:42:44 -07:00
Jee Jee Li	757ac70a64	[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273 )	2024-08-08 14:02:41 +00:00
Lucas Wilkinson	311f743831	[Bugfix] Fix gptq failure on T4s (#7264 )	2024-08-07 20:05:37 +00:00
Michael Goin	5223199e03	[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219 )	2024-08-07 11:23:12 -07:00
Isotr0py	b764547616	[Bugfix] Fix input processor for InternVL2 model (#7164 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-07 09:32:07 -07:00
Dipika Sikka	0f7052bc7e	[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874 )	2024-08-07 09:17:58 -07:00
Michael Goin	f9a5600649	[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225 )	2024-08-06 18:34:26 -07:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
Lily Liu	5c60c8c423	[SpecDecode] [Minor] Fix spec decode sampler tests (#7183 )	2024-08-06 10:40:32 -07:00
Cyrus Leung	1f26efbb3a	[Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-08-06 16:55:31 +08:00
Isotr0py	360bd67cf0	[Core] Support loading GGUF model (#5191 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-05 17:54:23 -06:00
Thomas Parnell	789937af2e	[Doc] [SpecDecode] Update MLPSpeculator documentation (#7100 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-08-05 23:29:43 +00:00
Jungho Christopher Cho	c0d8f1636c	[Model] SiglipVisionModel ported from transformers (#6942 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-05 06:22:12 +00:00
Alphi	7b86e7c9cd	[Model] Add multi-image support for minicpmv (#7122 ) Co-authored-by: hezhihui <hzh7269@modelbest.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-05 09:23:17 +08:00
Jee Jee Li	179a6a36f2	[Model]Refactor MiniCPMV (#7020 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-04 08:12:41 +00:00
Yihuan Bu	654bc5ca49	Support for guided decoding for offline LLM (#6878 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-08-04 03:12:09 +00:00
Isotr0py	0c25435daa	[Model] Refactor and decouple weight loading logic for InternVL2 model (#7067 )	2024-08-02 22:36:14 -07:00
Robert Shaw	ed812a73fa	[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` (#6883 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Joe Runde <joe@joerun.de> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-02 18:27:28 -07:00
Lucas Wilkinson	a8d604ca2a	[Misc] Disambiguate quantized types via a new ScalarType (#6396 )	2024-08-02 13:51:58 -07:00
Peng Guanwen	db35186391	[Core] Comment out unused code in sampler (#7023 )	2024-08-02 00:58:26 -07:00
Woosuk Kwon	805a8a75f2	[Misc] Support attention logits soft-capping with flash-attn (#7022 )	2024-08-01 13:14:37 -07:00
Murali Andoorveedu	fc912e0886	[Models] Support Qwen model with PP (#6974 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-08-01 12:40:43 -07:00
Michael Goin	f4fd390f5d	[Bugfix] Lower gemma's unloaded_params exception to warning (#7002 )	2024-08-01 12:01:07 -07:00

1 2 3 4 5 ...

635 Commits