xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-04-26 12:47:08 +08:00

Author	SHA1	Message	Date
Woosuk Kwon	cbc53b6b8d	[Hardware][TPU] Support parallel sampling & Swapping (#5855 )	2024-06-26 11:07:49 -07:00
sasha0552	c54269d967	[Frontend] Add tokenize/detokenize endpoints (#5054 )	2024-06-26 16:54:22 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Woosuk Kwon	3439c5a8e3	[Bugfix][TPU] Fix KV cache size calculation (#5860 )	2024-06-26 00:58:23 -07:00
Woosuk Kwon	6806998bf9	[Bugfix] Fix embedding to support 2D inputs (#5829 )	2024-06-26 00:15:22 -07:00
youkaichao	515080ad2f	[bugfix][distributed] fix shm broadcast when the queue size is full (#5801 )	2024-06-25 21:56:02 -07:00
Roger Wang	3aa7b6cf66	[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832 )	2024-06-25 20:34:25 -07:00
Stephanie Wang	dda4811591	[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408 ) Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Co-authored-by: Stephanie <swang@anyscale.com>	2024-06-25 20:30:03 -07:00
aws-patlange	82079729cc	[Bugfix] Fix assertion in NeuronExecutor (#5841 )	2024-06-25 19:52:10 -07:00
Woosuk Kwon	f178e56c68	[Hardware][TPU] Raise errors for unsupported sampling params (#5850 )	2024-06-25 16:58:23 -07:00
Matt Wong	dd793d1de5	[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422 )	2024-06-25 15:56:15 -07:00
Woosuk Kwon	bc34937d68	[Hardware][TPU] Refactor TPU backend (#5831 )	2024-06-25 15:25:52 -07:00
Dipika Sikka	dd248f7675	[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794 )	2024-06-25 19:23:35 +00:00
Antoni Baum	67882dbb44	[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748 )	2024-06-25 10:15:10 -07:00
Jie Fu (傅杰)	7b99314301	[Misc] Remove useless code in cpu_worker (#5824 )	2024-06-25 09:41:36 -07:00
Woo-Yeon Lee	2ce5d6688b	[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414 )	2024-06-25 09:56:06 +00:00
Chang Su	ba991d5c84	[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (#5795 )	2024-06-24 17:01:19 -06:00
Isotr0py	edd5fe5fa2	[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772 )	2024-06-24 12:11:53 +08:00
Murali Andoorveedu	5d4d90536f	[Distributed] Add send and recv helpers (#5719 )	2024-06-23 14:42:28 -07:00
youkaichao	832ea88fcb	[core][distributed] improve shared memory broadcast (#5754 )	2024-06-22 10:00:43 -07:00
Woosuk Kwon	0cbc1d2b4f	[Bugfix] Fix pin_lora error in TPU executor (#5760 )	2024-06-21 22:25:14 -07:00
zifeitong	ff9ddbceee	[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py (#5756 )	2024-06-22 03:33:12 +00:00
Jie Fu (傅杰)	9c62db07ed	[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (#5710 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-22 02:07:08 +00:00
rohithkrn	f5dda63eb5	[LoRA] Add support for pinning lora adapters in the LRU cache (#5603 )	2024-06-21 15:42:46 -07:00
Roger Wang	bd620b01fb	[Kernel][CPU] Add Quick `gelu` to CPU (#5717 )	2024-06-21 06:39:40 +00:00
youkaichao	d9a252bc8e	[Core][Distributed] add shm broadcast (#5399 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-21 05:12:35 +00:00
Jee Li	67005a07bc	[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-21 04:46:28 +00:00
Joshua Rosenkranz	b12518d3cf	[Model] MLPSpeculator speculative decoding support (#4947 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>	2024-06-20 20:23:12 -04:00
youkaichao	6c5b7af152	[distributed][misc] use fork by default for mp (#5669 )	2024-06-20 17:06:34 -07:00
Michael Goin	8065a7e220	[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718 )	2024-06-20 17:00:13 -06:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Roger Wang	ad137cd111	[Model] Port over CLIPVisionModel for VLMs (#5591 )	2024-06-20 11:52:09 +00:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
zifeitong	78687504f7	[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654 )	2024-06-19 13:57:12 -07:00
Michael Goin	afed90a034	[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688 )	2024-06-19 14:41:42 -04:00
Michael Goin	da971ec7a5	[Model] Add FP8 kv cache for Qwen2 (#5656 )	2024-06-19 09:38:26 +00:00
youkaichao	3eea74889f	[misc][distributed] use 127.0.0.1 for single-node (#5619 )	2024-06-19 08:05:00 +00:00
Shukant Pal	59a1eb59c9	[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628 )	2024-06-19 01:46:38 +00:00
Thomas Parnell	8a173382c8	[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-18 14:18:37 -07:00
sergey-tinkoff	07feecde1a	[Model] LoRA support added for command-r (#5178 )	2024-06-18 11:01:21 -07:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Chang Su	f0cc0e68e3	[Misc] Remove import from transformers logging (#5625 )	2024-06-18 12:12:19 +00:00
youkaichao	db5ec52ad7	[bugfix][distributed] improve p2p capability test (#5612 ) [bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612)	2024-06-18 07:21:05 +00:00
youkaichao	8eadcf0b90	[misc][typo] fix typo (#5620 )	2024-06-17 20:54:57 -07:00
Isotr0py	daef218b55	[Model] Initialize Phi-3-vision support (#4986 )	2024-06-17 19:34:33 -07:00
sroy745	fa9e385229	[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131 )	2024-06-17 21:29:09 -05:00
zifeitong	26e1188e51	[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606 )	2024-06-17 23:16:10 +00:00
Bruce Fontaine	a3e8a05d4c	[Bugfix] Fix KV head calculation for MPT models when using GQA (#5142 )	2024-06-17 15:26:41 -07:00
youkaichao	e441bad674	[Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584 )	2024-06-17 22:08:05 +00:00

1 2 3 4 5 ...

1052 Commits