xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-01-23 16:44:30 +08:00

Author	SHA1	Message	Date
Roger Wang	bd620b01fb	[Kernel][CPU] Add Quick `gelu` to CPU (#5717 )	2024-06-21 06:39:40 +00:00
youkaichao	d9a252bc8e	[Core][Distributed] add shm broadcast (#5399 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-21 05:12:35 +00:00
Jee Li	67005a07bc	[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-21 04:46:28 +00:00
Joshua Rosenkranz	b12518d3cf	[Model] MLPSpeculator speculative decoding support (#4947 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>	2024-06-20 20:23:12 -04:00
youkaichao	6c5b7af152	[distributed][misc] use fork by default for mp (#5669 )	2024-06-20 17:06:34 -07:00
Michael Goin	8065a7e220	[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718 )	2024-06-20 17:00:13 -06:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Roger Wang	ad137cd111	[Model] Port over CLIPVisionModel for VLMs (#5591 )	2024-06-20 11:52:09 +00:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
zifeitong	78687504f7	[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654 )	2024-06-19 13:57:12 -07:00
Michael Goin	afed90a034	[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688 )	2024-06-19 14:41:42 -04:00
Michael Goin	da971ec7a5	[Model] Add FP8 kv cache for Qwen2 (#5656 )	2024-06-19 09:38:26 +00:00
youkaichao	3eea74889f	[misc][distributed] use 127.0.0.1 for single-node (#5619 )	2024-06-19 08:05:00 +00:00
Shukant Pal	59a1eb59c9	[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628 )	2024-06-19 01:46:38 +00:00
Thomas Parnell	8a173382c8	[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-18 14:18:37 -07:00
sergey-tinkoff	07feecde1a	[Model] LoRA support added for command-r (#5178 )	2024-06-18 11:01:21 -07:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Chang Su	f0cc0e68e3	[Misc] Remove import from transformers logging (#5625 )	2024-06-18 12:12:19 +00:00
youkaichao	db5ec52ad7	[bugfix][distributed] improve p2p capability test (#5612 ) [bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612)	2024-06-18 07:21:05 +00:00
youkaichao	8eadcf0b90	[misc][typo] fix typo (#5620 )	2024-06-17 20:54:57 -07:00
Isotr0py	daef218b55	[Model] Initialize Phi-3-vision support (#4986 )	2024-06-17 19:34:33 -07:00
sroy745	fa9e385229	[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131 )	2024-06-17 21:29:09 -05:00
zifeitong	26e1188e51	[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606 )	2024-06-17 23:16:10 +00:00
Bruce Fontaine	a3e8a05d4c	[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142 )	2024-06-17 15:26:41 -07:00
youkaichao	e441bad674	[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584 )	2024-06-17 22:08:05 +00:00
youkaichao	1b44aaf4e3	[bugfix][distributed] fix 16 gpus local rank arrangement (#5604 )	2024-06-17 21:35:04 +00:00
Kunshang Ji	728c4c8a06	[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-06-17 11:01:25 -07:00
Dipika Sikka	890d8d960b	[Kernel] `compressed-tensors` marlin 24 support (#5435 )	2024-06-17 12:32:48 -04:00
Charles Riggins	9e74d9d003	Correct alignment in the seq_len diagram. (#5592 ) Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>	2024-06-17 12:05:33 -04:00
Amit Garg	9333fb8eb9	[Model] Rename Phi3 rope scaling type (#5595 )	2024-06-17 12:04:14 -04:00
zifeitong	3ce2c050dd	[Fix] Correct OpenAI batch response format (#5554 )	2024-06-15 16:57:54 -07:00
Nick Hill	1c0afa13c5	[BugFix] Don't start a Ray cluster when not using Ray (#5570 )	2024-06-15 16:30:51 -07:00
SangBin Cho	e691918e3b	[misc] Do not allow to use lora with chunked prefill. (#5538 ) Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-06-15 14:59:36 +00:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
leiwen83	1b8a0d71cf	[Core][Bugfix]: fix prefix caching for blockv2 (#5364 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-06-14 17:23:56 -07:00
youkaichao	f5bb85b435	[Core][Distributed] improve p2p cache generation (#5528 )	2024-06-14 14:47:45 -07:00
Woosuk Kwon	28c145eb57	[Bugfix] Fix typo in Pallas backend (#5558 )	2024-06-14 14:40:09 -07:00
Thomas Parnell	e2afb03c92	[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-14 20:28:11 +00:00
Sanger Steel	6e2527a7cb	[Doc] Update documentation on Tensorizer (#5471 )	2024-06-14 11:27:57 -07:00
youkaichao	d1c3d7d139	[misc][distributed] fix benign error in `is_in_the_same_node` (#5512 )	2024-06-14 10:59:28 -07:00
Cyrus Leung	77490c6f2f	[Core] Remove duplicate processing in async engine (#5525 )	2024-06-14 10:04:42 -07:00
Robert Shaw	15985680e2	[ Misc ] Rs/compressed tensors cleanup (#5432 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>	2024-06-14 10:01:46 -07:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Simon Mo	0f0d8bc065	bump version to v0.5.0.post1 (#5522 )	2024-06-13 19:42:06 -07:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	e38042d4af	[Kernel] Disable CUTLASS kernels for fp8 (#5505 )	2024-06-13 13:38:05 -07:00
Antoni Baum	6b0511a57b	Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478 )	2024-06-13 11:22:50 -07:00
Cody Yu	30299a41fa	[MISC] Remove FP8 warning (#5472 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-13 11:22:30 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00

1 2 3 4 5 ...

1028 Commits