xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-05-28 09:47:05 +08:00

Author	SHA1	Message	Date
youkaichao	d1c3d7d139	[misc][distributed] fix benign error in `is_in_the_same_node` (#5512 )	2024-06-14 10:59:28 -07:00
Cyrus Leung	77490c6f2f	[Core] Remove duplicate processing in async engine (#5525 )	2024-06-14 10:04:42 -07:00
Robert Shaw	15985680e2	[ Misc ] Rs/compressed tensors cleanup (#5432 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>	2024-06-14 10:01:46 -07:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Simon Mo	0f0d8bc065	bump version to v0.5.0.post1 (#5522 )	2024-06-13 19:42:06 -07:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	e38042d4af	[Kernel] Disable CUTLASS kernels for fp8 (#5505 )	2024-06-13 13:38:05 -07:00
Antoni Baum	6b0511a57b	Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478 )	2024-06-13 11:22:50 -07:00
Cody Yu	30299a41fa	[MISC] Remove FP8 warning (#5472 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-13 11:22:30 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cyrus Leung	0ce7b952f8	[Doc] Update LLaVA docs (#5437 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-13 11:22:07 -07:00
Cyrus Leung	03dccc886e	[Misc] Add vLLM version getter to utils (#5098 )	2024-06-13 11:21:39 -07:00
Li, Jiang	80aa7e91fc	[Hardware][Intel] Optimize CPU backend and add more performance tips (#4971 ) Co-authored-by: Jianan Gu <jianan.gu@intel.com>	2024-06-13 09:33:14 -07:00
wenyujin333	bd43973522	[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497 ) Tune Qwen2-57B-A14B configs based on #4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s	2024-06-13 09:01:10 -07:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
youkaichao	ea3890a5f0	[Core][Distributed] code deduplication in tp&pp with coordinator(#5293 ) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)	2024-06-12 17:27:08 -07:00
Isotr0py	2135cacb45	[Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451 )	2024-06-12 16:20:18 -07:00
Michael Goin	7d19de2e9c	[Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425 )	2024-06-12 18:42:12 -04:00
Michael Goin	94a07bbdd8	[Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470 )	2024-06-12 21:59:44 +00:00
youkaichao	622d45128c	[misc] add hint for AttributeError (#5462 )	2024-06-12 21:46:35 +00:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Arthur Kim	5cc50a531f	[Bugfix] TYPE_CHECKING for MultiModalData (#5444 )	2024-06-12 14:08:52 -07:00
Li, Jiang	c3c2903e72	[Bugfix] Add device assertion to TorchSDPA (#5402 )	2024-06-12 12:58:53 -07:00
Woosuk Kwon	1a8bfd92d5	[Hardware] Initial TPU integration (#5292 )	2024-06-12 11:53:03 -07:00
Nick Hill	99dac099ab	[Core][Doc] Default to multiprocessing for single-node distributed case (#5230 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-11 11:10:41 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
sasha0552	dcbf4286af	[Frontend] Customizable RoPE theta (#5197 )	2024-06-11 10:42:26 -07:00
Ali Panahi	00e6a2dc53	[Bugfix] fix lora_dtype value type in arg_utils.py (#5398 )	2024-06-11 10:40:23 -07:00
Junichi Sato	2e02311a1b	[Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 (#5254 )	2024-06-11 10:38:07 -07:00
Woosuk Kwon	8bab4959be	[Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389 )	2024-06-11 00:37:56 -07:00
Cyrus Leung	640052b069	[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026 )	2024-06-10 22:36:46 -07:00
maor-ps	351d5e7b82	[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-11 10:30:31 +08:00
Nick Hill	a008629807	[Misc] Various simplifications and typing fixes (#5368 )	2024-06-11 10:29:02 +08:00
Simon Mo	114332b88e	Bump version to v0.5.0 (#5384 )	2024-06-10 15:56:06 -07:00
Cyrus Leung	2c0d933594	[Bugfix] Fix LLaVA-NeXT (#5380 )	2024-06-10 15:38:47 +00:00
Itay Etelis	774d1035e4	[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319 )	2024-06-10 14:22:09 +00:00
Cyrus Leung	6b29d6fe70	[Model] Initial support for LLaVA-NeXT (#4199 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-10 12:47:15 +00:00
Cyrus Leung	0bfa1c4f13	[Misc] Improve error message when LoRA parsing fails (#5194 )	2024-06-10 19:38:49 +08:00
youkaichao	c81da5f56d	[misc][typo] fix typo (#5372 )	2024-06-10 09:51:02 +00:00
Roger Wang	68bc81703e	[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (#5374 )	2024-06-10 09:13:39 +00:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
Bla_ckB	45f92c00cf	[Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164 )	2024-06-09 16:23:14 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
youkaichao	0373e1837e	[Core][CUDA Graph] add output buffer for cudagraph (#5074 ) [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)	2024-06-08 19:14:43 -07:00
Michael Goin	c09dade2a2	[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353 )	2024-06-08 13:54:05 -04:00
Hongxia Yang	c96fc06747	[ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965 )	2024-06-07 19:13:12 -07:00
Cheng Li	e69ded7d1c	[Bug Fix] Fix the support check for FP8 CUTLASS (#5352 ) Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.	2024-06-08 00:42:05 +00:00
Calvinn Ng	767c727a81	fix DbrxFusedNormAttention missing cache_config (#5340 ) Co-authored-by: team <calvinn.ng@ahrefs.com>	2024-06-07 14:10:21 -07:00
Roger Wang	7a9cb294ae	[Frontend] Add OpenAI Vision API Support (#5237 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-07 11:23:32 -07:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00

1 2 3 4 5 ...

988 Commits