xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-06-10 12:49:08 +08:00

Author	SHA1	Message	Date
Lucas Wilkinson	75e94309e8	[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676 ) Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-02-04 18:22:24 -08:00
Mark McLoughlin	233df6f5c4	[V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-02-04 19:46:54 -05:00
Cyrus Leung	18016a5e62	[Bugfix] Fix CI failures for InternVL and Mantis models (#12728 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-02-04 23:54:23 +08:00
Sophie du Couédic	649550f27e	[Build] update requirements of no-device for plugin usage (#12630 ) Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>	2025-02-04 21:19:12 +08:00
Kero Liang	62467a834a	Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722 ) Signed-off-by: imkero <kerorek@outlook.com>	2025-02-04 21:03:19 +08:00
Michael Greenbaum	6469038b14	[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689 ) Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com> Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>	2025-02-04 20:58:48 +08:00
Isotr0py	815079de8e	[VLM] merged multimodal processor and V1 support for idefics3 (#12660 ) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2025-02-04 20:00:51 +08:00
Woosuk Kwon	18a88fcccc	[V1] Remove scheduling constraint on partial requests (#12674 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-04 02:43:58 -08:00
Cyrus Leung	d1ca7df84d	[VLM] Merged multi-modal processor for InternVL-based models (#12553 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>	2025-02-04 16:44:52 +08:00
Jee Jee Li	96b23621c1	[Misc] Add BNB quantization for Whisper (#12381 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-02-04 16:27:36 +08:00
Hongxia Yang	c36ac98d01	[AMD][ROCm] Enable DeepSeek model on ROCm (#12662 ) Signed-off-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>	2025-02-04 08:24:11 +00:00
Kyle Sayers	4896d0c2dd	[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711 )	2025-02-03 23:27:11 -08:00
Thomas Parnell	bb392af434	[Doc] Replace ibm-fms with ibm-ai-platform (#12709 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2025-02-04 07:05:04 +00:00
Michael Goin	5d98d56089	Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2025-02-04 11:55:46 +08:00
Russell Bryant	73b35cca7f	[Core] Improve hash collision avoidance in prefix caching (#12621 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-03 16:28:20 -08:00
Cody Yu	5095e96606	[V1] Revert `uncache_blocks` and support recaching full blocks (#12415 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-03 15:04:53 -08:00
Cody Yu	cf58b9c4ca	[MISC] Remove model input dumping when exception (#12582 ) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-02-03 13:34:16 -08:00
kushanam	4797dad3ec	[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707 )	2025-02-03 13:30:39 -08:00
Kyle Sayers	6dd5e52823	Squelch MLA warning for Compressed-Tensors Models (#12704 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2025-02-03 13:29:56 -08:00
Tyler Michael Smith	c11de33dad	[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696 ) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-03 13:04:59 -08:00
Russell Bryant	33e0602e59	[Misc] Fix improper placement of SPDX header in scripts (#12694 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-03 11:16:59 -08:00
Arthur	a1a2aaadb9	[Model]: Add `transformers` backend support (#11330 ) # Adds support for `transformers` as a backend Following https://github.com/huggingface/transformers/pull/35235, a bunch of models should already be supported, we are ramping up support for more models. Thanks @Isotr0py for the TP support, and @hmellor for his help as well! This includes: - `trust_remote_code=True` support: any model on the hub, if it implements attention the correct way can be natively supported!! - tensor parallel support --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-02-03 21:30:38 +08:00
youkaichao	1298a400e8	[ci/build] fix gh200 test (#12681 ) Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-03 15:59:49 +08:00
youkaichao	ad4a9dc817	[cuda] manually import the correct pynvml module (#12679 ) fixes problems like https://github.com/vllm-project/vllm/pull/12635 and https://github.com/vllm-project/vllm/pull/12636 and https://github.com/vllm-project/vllm/pull/12565 --------- Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-03 15:58:21 +08:00
Srikanth Srinivas	b9986454fe	Fix for attention layers to remain unquantized during moe_wn16 quant (#12570 ) Fix to AWQ quant loading of the new R1 model The new optimized MoE kernels for a large number of experts `moe_wn16` uses AWQ quant which requires the attention layers to be in 16bit The current merge has broken this, and the `get_quant_method` must return None for it to work correctly again --------- Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Beim <beim2015@outlook.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Ryan N <ryan.nguyen@centml.ai> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Beim <805908499@qq.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: fade_away <1028552010@qq.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Shawn Du <shawnd200@outlook.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: youkaichao <youkaichao@gmail.com>	2025-02-03 13:46:19 +08:00
Eldar Kurtic	c5932e5dac	Properly check if all fused layers are in the list of targets (#12666 ) Thanks @kylesayrs for catching this!	2025-02-03 13:42:18 +08:00
youkaichao	20579c0fae	make sure mistral_common not imported for non-mistral models (#12669 ) When people use deepseek models, they find that they need to solve cv2 version conflict, see https://zhuanlan.zhihu.com/p/21064432691 . I added the check, and make all imports of `cv2` lazy. --------- Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-03 13:40:25 +08:00
Yang Chen	95460fc513	[Kernel] port sgl moe_align_block_size kernels (#12574 ) sgl_moe_align_block_size is based on: `ded9fcd09a` moe_align_block_size is based on: `ba5112ff69` Signed-off-by: Yang Chen <yangche@fb.com>	2025-02-03 13:09:50 +08:00
Zhuohan Li	326fcc8b9f	[Doc] Deprecate Discord (#12668 )	2025-02-02 19:19:56 -08:00
youkaichao	e64330910b	[doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667 ) As more and more people are trying deepseek models with multi-node inference, https://github.com/vllm-project/vllm/issues/7815 becomes more frequent. Let's give clear message to users. Signed-off-by: youkaichao <youkaichao@gmail.com>	2025-02-03 09:32:18 +08:00
Russell Bryant	e489ad7a21	[Misc] Add SPDX-License-Identifier headers to python source files (#12628 ) - Add SPDX license headers to python source files - Check for SPDX headers using pre-commit commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-02 11:58:18 -08:00
Kunshang Ji	f256ebe4df	[Hardware][Intel GPU] add XPU bf16 support (#12392 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2025-02-02 10:17:26 +00:00
Shawn Du	f8ece6e17f	[Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608 ) As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>	2025-02-02 16:40:58 +08:00
Woosuk Kwon	abfcdcdf27	[V1][Minor] Avoid frequently creating ConstantList (#12653 ) A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used. Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-02-01 23:43:20 -08:00
Russell Bryant	e497f33491	[Core] Silence unnecessary deprecation warnings (#12620 ) I noticed during testing that I was getting a lot of these deprecation warnings about `local_lora_path`: ``` DeprecationWarning: The 'lora_local_path' attribute is deprecated and will be removed in a future version. Please use 'lora_path' instead. ``` The check used for emitting this warning was always True, even when the parameter was not actually specified. It will always be in `__struct_fields__`. We should be checking for a non-None value, instead. Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-02-02 15:35:50 +08:00
Jinzhen Lin	baaa2b24da	[Bugfix] fix moe_wna16 get_quant_method (#12648 ) Fix https://github.com/vllm-project/vllm/issues/12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. `baeded2569/vllm/attention/layer.py (L86-L92)` Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>	2025-02-02 15:29:56 +08:00
Vicente Herrera	b4e5c03306	doc: fixing minor typo in readme.md (#12643 ) Word "evolved" was mistyped Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> --------- Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>	2025-02-01 17:17:29 +00:00
Michael Goin	3194039c0e	Apply torch.compile to fused_moe/grouped_topk (#12637 )	2025-02-01 16:16:19 +00:00
Simon Mo	4f4d427ac2	Disable chunked prefill and/or prefix caching when MLA is enabled (#12642 ) From @mgoin in https://github.com/vllm-project/vllm/pull/12638 I cannot push to that branch, therefore a new PR to unblock release. --------- Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Co-authored-by: mgoin <michael@neuralmagic.com> v0.7.1	2025-01-31 23:46:57 -08:00
Russell Bryant	1e3698393f	[CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280 ) We have `v1`, `structured-output`, and `speculative-decoding` labels on github. This adds automation for applying these labels based on the files touched by a PR. Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-01-31 23:13:10 -08:00
Lucas Wilkinson	baeded2569	[Attention] Deepseek v3 MLA support with FP8 compute (#12601 ) This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights --------- Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>	2025-01-31 21:52:51 -08:00
Rahul Tuli	3e1c76cf3a	Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517 ) This PR addresses a bug in the Cutlass integration where the `sparsity_config.ignore` list was not being respected. When only a subset of modules were configured as Sparse24, the system incorrectly selected Cutlass for non-sparse modules as well. This update ensures the correct scheme is selected for non-sparse modules, fixing this behavior. --- ### Changes - Updated logic to correctly respect `sparsity_config.ignore`. - Ensured non-sparse modules use the appropriate scheme instead of defaulting to Cutlass. --- <details> <summary>Testing Setup</summary> The fix has been tested on top of [this diff](https://github.com/vllm-project/vllm/pull/12097). #### Steps to Test: ```bash git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16 git cherry-pick ca624cddb # this branch ``` #### Additional Patch Required: ```diff diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index a54177c1c..f916dd0c9 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs, QuantizationStrategy, QuantizationType) from pydantic import BaseModel - +from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, UnquantizedLinearMethod) @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import ( should_ignore_layer) from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod from vllm.platforms import current_platform - +logger = init_logger(__name__) __all__ = ["CompressedTensorsLinearMethod"] SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config" ``` Apply using: ```bash git apply logging-patch.patch ``` </details> --- <details> <summary>Models Tested</summary> - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed` </details> --- <details> <summary>Example Output</summary> #### Layers 0-5 (Sparse24) ``` Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj ... ``` #### Layers 6+ (Non-Sparse, FP8) ``` Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj ... ``` </details> Note: Assumed all modules in fused layers such as `QKV_proj` and `Gate_up_proj` follow the same quantization/pruning scheme. --- For related tasks using the Asana app for GitHub, refer to [[this link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160). Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>	2025-02-01 13:41:59 +08:00
Tyler Michael Smith	cfa134d247	[Bugfix/CI] Fixup benchmark_moe.py (#12562 ) Fixes `is_marlin` not being passed into `get_default_config` Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size` Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-01 13:41:35 +08:00
Kevin H. Luu	35b7a05507	[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599 )	2025-01-31 21:22:23 -08:00
Eldar Kurtic	1867c258bd	Fix target matching for fused layers with compressed-tensors (#12617 ) Without this PR --------------- Quantizing models with llm-compressor and a recipe that explicitly lists names of layers produces a model that is not loadable by vLLM (i.e. `vllm serve <model>` fails with `raise ValueError(f"Unable to find matching target for {module} in the ...`). Example recipe: ``` recipe = """ quantization_stage: run_type: oneshot quantization_modifiers: GPTQModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 4 type: "int" symmetric: true strategy: "group" group_size: 128 targets: [ "model.layers.0.mlp.down_proj", "model.layers.2.mlp.down_proj", "model.layers.3.mlp.down_proj", "model.layers.4.mlp.down_proj", "model.layers.5.mlp.down_proj", "model.layers.6.mlp.down_proj", "model.layers.7.mlp.down_proj", "model.layers.8.mlp.down_proj", "model.layers.9.mlp.down_proj", "model.layers.10.mlp.down_proj", "model.layers.11.mlp.down_proj", "model.layers.12.mlp.down_proj", "model.layers.13.mlp.down_proj", "model.layers.14.mlp.down_proj", "model.layers.15.mlp.down_proj", "model.layers.16.mlp.down_proj", "model.layers.17.mlp.down_proj", "model.layers.19.mlp.down_proj", "model.layers.21.mlp.down_proj", "model.layers.22.mlp.down_proj", . . . ] """ ``` To reproduce the vLLM error: ```bash vllm serve nm-testing/eldar-test ``` With this PR ------------ Models are loaded correctly without any errors.	2025-02-01 05:07:46 +00:00
fade_away	cb3e73e4c8	[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161 ) FIX issue https://github.com/vllm-project/vllm/issues/9688 https://github.com/vllm-project/vllm/issues/11086 #12487 --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-02-01 12:52:07 +08:00
Robert Shaw	b1340f9d55	[V1] Bugfix: Validate Model Input Length (#12600 ) SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX #12567(link existing issues this PR will resolve)	2025-01-31 18:32:04 -08:00
Brian Dellabetta	44bbca78d7	[Doc] int4 w4a16 example (#12585 ) Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) FIX #n/a (no issue created) @kylesayrs and I have discussed a couple additional improvements for the quantization docs. We will revisit at a later date, possibly including: - A section for "choosing the correct quantization scheme/ compression technique" - Additional vision or audio calibration datasets --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-01-31 15:38:48 -08:00
Harry Mellor	60808bd4c7	[Doc] Improve installation signposting (#12575 ) - Make device tab names more explicit - Add comprehensive list of devices to https://docs.vllm.ai/en/latest/getting_started/installation/index.html - Add `attention` blocks to the intro of all devices that don't have pre-built wheels/images --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-31 15:38:35 -08:00
Ryan Nguyen	fc542144c4	[Feature] Fix guided decoding blocking bitmask memcpy (#12563 ) [Guided decoding performance optimization] Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <ryan.nguyen@centml.ai>	2025-01-31 15:37:30 -08:00

1 2 3 4 5 ...

4428 Commits