xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-01-25 05:44:27 +08:00

Author	SHA1	Message	Date
Woosuk Kwon	805a8a75f2	[Misc] Support attention logits soft-capping with flash-attn (#7022 )	2024-08-01 13:14:37 -07:00
omkar kakarparthi	562e580abc	Update run-amd-test.sh (#7044 )	2024-08-01 13:12:37 -07:00
Murali Andoorveedu	fc912e0886	[Models] Support Qwen model with PP (#6974 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-08-01 12:40:43 -07:00
Michael Goin	f4fd390f5d	[Bugfix] Lower gemma's unloaded_params exception to warning (#7002 )	2024-08-01 12:01:07 -07:00
Michael Goin	fb3db61688	[CI/Build] Remove sparseml requirement from testing (#7037 )	2024-08-01 12:00:51 -07:00
Isotr0py	2dd34371a6	[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm (#6992 )	2024-08-01 12:00:28 -07:00
Sage Moore	7e0861bd0b	[CI/Build] Update PyTorch to 2.4.0 (#6951 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-08-01 11:11:24 -07:00
Alexei-V-Ivanov-AMD	a72a424b3e	[Build/CI] Fixing Docker Hub quota issue. (#7043 )	2024-08-01 11:07:37 -07:00
youkaichao	c8a7e93273	[core][scheduler] simplify and improve scheduler (#6867 )	2024-07-31 23:51:09 -07:00
zifeitong	3c10591ef2	[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954 )	2024-07-31 21:13:34 -07:00
Aurick Qiao	0437492ea9	PP comm optimization: replace send with partial send + allgather (#6695 ) Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>	2024-07-31 20:15:42 -07:00
Travis Johnson	630dd9e0ae	[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-07-31 19:49:11 -07:00
Woosuk Kwon	23993a7997	[Bugfix][TPU] Do not use torch.Generator for TPUs (#6981 )	2024-07-31 18:50:28 -07:00
xuyi	1d2e7fb73f	[Model] Pipeline parallel support for Qwen2 (#6924 )	2024-07-31 18:49:51 -07:00
Jee Jee Li	7ecee34321	[Kernel][RFC] Refactor the punica kernel based on Triton (#5036 )	2024-07-31 17:12:24 -07:00
Simon Mo	7eb0cb4a14	Revert "[Frontend] Factor out code for running uvicorn" (#7012 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-07-31 16:34:26 -07:00
Michael Goin	a0dce9383a	[Misc] Add compressed-tensors to optimized quant list (#7006 )	2024-07-31 14:40:44 -07:00
Varun Sundar Rabindranath	35e9c12bfa	[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath	93548eb37e	[Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-31 14:40:22 -07:00
Michael Goin	460c1884e3	[Bugfix] Support cpu offloading with fp8 quantization (#6960 )	2024-07-31 12:47:46 -07:00
Cody Yu	bd70013407	[MISC] Introduce pipeline parallelism partition strategies (#6920 ) Co-authored-by: youkaichao <youkaichao@126.com>	2024-07-31 12:02:17 -07:00
Avshalom Manevich	2ee8d3ba55	[Model] use FusedMoE layer in Jamba (#6935 )	2024-07-31 12:00:24 -07:00
Cyrus Leung	daed30c4a9	[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982 )	2024-07-31 23:46:17 +08:00
Alphi	2f4e108f75	[Bugfix] Clean up MiniCPM-V (#6939 ) Co-authored-by: hezhihui <hzh7269@modelbest.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-31 14:39:19 +00:00
HandH1998	6512937de1	Support W4A8 quantization for vllm (#5218 )	2024-07-31 07:55:21 -06:00
Fei	c0644cf9ce	[Bugfix] fix logit processor excceed vocab size issue (#6927 )	2024-07-31 16:16:01 +08:00
Woosuk Kwon	533d1932d2	[Bugfix][TPU] Set readonly=True for non-root devices (#6980 )	2024-07-31 00:19:28 -07:00
Cyrus Leung	9f0e69b653	[CI/Build] Fix mypy errors (#6968 )	2024-07-30 19:49:48 -07:00
Cyrus Leung	f230cc2ca6	[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836 )	2024-07-31 10:38:45 +08:00
Cyrus Leung	da1f7cc12a	[mypy] Enable following imports for some directories (#6681 )	2024-07-31 10:38:03 +08:00
Cade Daniel	c32ab8be1a	[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding (#6964 )	2024-07-31 00:53:21 +00:00
Cade Daniel	fb4f530bf5	[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists (#6706 )	2024-07-30 16:28:49 -07:00
Cade Daniel	79319cedfa	[Nightly benchmarking suite] Remove pkill python from run benchmark suite (#6965 )	2024-07-30 16:28:05 -07:00
Simon Mo	40c27a7cbb	[Build] Temporarily Disable Kernels and LoRA tests (#6961 )	2024-07-30 14:59:48 -07:00
youkaichao	6ca8031e71	[core][misc] improve free_finished_seq_groups (#6865 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-30 14:32:12 -07:00
Tyler Michael Smith	d7a299edaa	[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842 )	2024-07-30 16:37:01 -04:00
Sanger Steel	052b6f8ca4	[Bugfix] Fix tensorizer memory profiling bug during testing (#6881 )	2024-07-30 11:48:50 -07:00
Ilya Lavrenov	5895b24677	[OpenVINO] Updated OpenVINO requirements and build docs (#6948 )	2024-07-30 11:33:01 -07:00
Tyler Michael Smith	cbbc904470	[Kernel] Squash a few more warnings (#6914 )	2024-07-30 13:50:42 -04:00
Nick Hill	5cf9254a9c	[BugFix] Fix use of per-request seed with pipeline parallel (#6698 )	2024-07-30 10:40:08 -07:00
fzyzcjy	f058403683	[Doc] Super tiny fix doc typo (#6949 )	2024-07-30 09:14:03 -07:00
Roger Wang	c66c7f86ac	[Bugfix] Fix PaliGemma MMP (#6930 )	2024-07-30 02:20:57 -07:00
Woosuk Kwon	6e063ea35b	[TPU] Fix greedy decoding (#6933 )	2024-07-30 02:06:29 -07:00
Varun Sundar Rabindranath	af647fb8b3	[Kernel] Tuned int8 kernels for Ada Lovelace (#6848 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-07-29 20:24:58 -06:00
Tyler Michael Smith	61a97c32f6	[Kernel] Fix marlin divide-by-zero warnings (#6904 )	2024-07-30 01:26:07 +00:00
Kevin H. Luu	4fbf4aa128	[ci] GHA workflow to remove ready label upon "/notready" comment (#6921 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-07-29 17:03:45 -07:00
Tyler Michael Smith	aae6d36f7e	[Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908 )	2024-07-29 18:01:17 -06:00
Nick Hill	9f69d8245a	[Frontend] New `allowed_token_ids` decoding request parameter (#6753 )	2024-07-29 23:37:27 +00:00
Thomas Parnell	9a7e2d0534	[Bugfix] Allow vllm to still work if triton is not installed. (#6786 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-29 14:51:27 -07:00
Earthwalker	7f8d612d24	[TPU] Support tensor parallelism in async llm engine (#6891 )	2024-07-29 12:42:21 -07:00

1 2 3 4 5 ...

2160 Commits