xinyun/vllm - vllm - 丝路新云-代码仓

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-07-06 07:47:08 +08:00

Author	SHA1	Message	Date
Michael Goin	3194039c0e	Apply torch.compile to fused_moe/grouped_topk (#12637 )	2025-02-01 16:16:19 +00:00
Simon Mo	4f4d427ac2	Disable chunked prefill and/or prefix caching when MLA is enabled (#12642 ) From @mgoin in https://github.com/vllm-project/vllm/pull/12638 I cannot push to that branch, therefore a new PR to unblock release. --------- Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Co-authored-by: mgoin <michael@neuralmagic.com> v0.7.1	2025-01-31 23:46:57 -08:00
Russell Bryant	1e3698393f	[CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280 ) We have `v1`, `structured-output`, and `speculative-decoding` labels on github. This adds automation for applying these labels based on the files touched by a PR. Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>	2025-01-31 23:13:10 -08:00
Lucas Wilkinson	baeded2569	[Attention] Deepseek v3 MLA support with FP8 compute (#12601 ) This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights --------- Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>	2025-01-31 21:52:51 -08:00
Rahul Tuli	3e1c76cf3a	Fix: Respect `sparsity_config.ignore` in Cutlass Integration (#12517 ) This PR addresses a bug in the Cutlass integration where the `sparsity_config.ignore` list was not being respected. When only a subset of modules were configured as Sparse24, the system incorrectly selected Cutlass for non-sparse modules as well. This update ensures the correct scheme is selected for non-sparse modules, fixing this behavior. --- ### Changes - Updated logic to correctly respect `sparsity_config.ignore`. - Ensured non-sparse modules use the appropriate scheme instead of defaulting to Cutlass. --- <details> <summary>Testing Setup</summary> The fix has been tested on top of [this diff](https://github.com/vllm-project/vllm/pull/12097). #### Steps to Test: ```bash git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16 git cherry-pick ca624cddb # this branch ``` #### Additional Patch Required: ```diff diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index a54177c1c..f916dd0c9 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs, QuantizationStrategy, QuantizationType) from pydantic import BaseModel - +from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, UnquantizedLinearMethod) @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import ( should_ignore_layer) from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod from vllm.platforms import current_platform - +logger = init_logger(__name__) __all__ = ["CompressedTensorsLinearMethod"] SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config" ``` Apply using: ```bash git apply logging-patch.patch ``` </details> --- <details> <summary>Models Tested</summary> - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed` </details> --- <details> <summary>Example Output</summary> #### Layers 0-5 (Sparse24) ``` Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj ... ``` #### Layers 6+ (Non-Sparse, FP8) ``` Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj ... ``` </details> Note: Assumed all modules in fused layers such as `QKV_proj` and `Gate_up_proj` follow the same quantization/pruning scheme. --- For related tasks using the Asana app for GitHub, refer to [[this link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160). Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>	2025-02-01 13:41:59 +08:00
Tyler Michael Smith	cfa134d247	[Bugfix/CI] Fixup benchmark_moe.py (#12562 ) Fixes `is_marlin` not being passed into `get_default_config` Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size` Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-02-01 13:41:35 +08:00
Kevin H. Luu	35b7a05507	[ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599 )	2025-01-31 21:22:23 -08:00
Eldar Kurtic	1867c258bd	Fix target matching for fused layers with compressed-tensors (#12617 ) Without this PR --------------- Quantizing models with llm-compressor and a recipe that explicitly lists names of layers produces a model that is not loadable by vLLM (i.e. `vllm serve <model>` fails with `raise ValueError(f"Unable to find matching target for {module} in the ...`). Example recipe: ``` recipe = """ quantization_stage: run_type: oneshot quantization_modifiers: GPTQModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 4 type: "int" symmetric: true strategy: "group" group_size: 128 targets: [ "model.layers.0.mlp.down_proj", "model.layers.2.mlp.down_proj", "model.layers.3.mlp.down_proj", "model.layers.4.mlp.down_proj", "model.layers.5.mlp.down_proj", "model.layers.6.mlp.down_proj", "model.layers.7.mlp.down_proj", "model.layers.8.mlp.down_proj", "model.layers.9.mlp.down_proj", "model.layers.10.mlp.down_proj", "model.layers.11.mlp.down_proj", "model.layers.12.mlp.down_proj", "model.layers.13.mlp.down_proj", "model.layers.14.mlp.down_proj", "model.layers.15.mlp.down_proj", "model.layers.16.mlp.down_proj", "model.layers.17.mlp.down_proj", "model.layers.19.mlp.down_proj", "model.layers.21.mlp.down_proj", "model.layers.22.mlp.down_proj", . . . ] """ ``` To reproduce the vLLM error: ```bash vllm serve nm-testing/eldar-test ``` With this PR ------------ Models are loaded correctly without any errors.	2025-02-01 05:07:46 +00:00
fade_away	cb3e73e4c8	[BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161 ) FIX issue https://github.com/vllm-project/vllm/issues/9688 https://github.com/vllm-project/vllm/issues/11086 #12487 --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2025-02-01 12:52:07 +08:00
Robert Shaw	b1340f9d55	[V1] Bugfix: Validate Model Input Length (#12600 ) SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX #12567(link existing issues this PR will resolve)	2025-01-31 18:32:04 -08:00
Brian Dellabetta	44bbca78d7	[Doc] int4 w4a16 example (#12585 ) Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) FIX #n/a (no issue created) @kylesayrs and I have discussed a couple additional improvements for the quantization docs. We will revisit at a later date, possibly including: - A section for "choosing the correct quantization scheme/ compression technique" - Additional vision or audio calibration datasets --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-01-31 15:38:48 -08:00
Harry Mellor	60808bd4c7	[Doc] Improve installation signposting (#12575 ) - Make device tab names more explicit - Add comprehensive list of devices to https://docs.vllm.ai/en/latest/getting_started/installation/index.html - Add `attention` blocks to the intro of all devices that don't have pre-built wheels/images --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-31 15:38:35 -08:00
Ryan Nguyen	fc542144c4	[Feature] Fix guided decoding blocking bitmask memcpy (#12563 ) [Guided decoding performance optimization] Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <ryan.nguyen@centml.ai>	2025-01-31 15:37:30 -08:00
Tyler Michael Smith	eb5741ad42	[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587 ) Integrates the block-quantized kernels introduced in https://github.com/vllm-project/vllm/pull/11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-01-31 15:29:11 -08:00
Robert Shaw	145c2ff648	[Bugfix] Revert MoE Triton Config Default (#12629 ) SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>	2025-01-31 15:28:47 -08:00
Kevin H. Luu	415f19474d	[release] Add input step to ask for Release version (#12631 ) Instead of having to create a new build with release version put in as env var.	2025-01-31 13:39:36 -08:00
Chen Zhang	89003c4082	[v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603 ) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-01-31 13:13:04 -08:00
Cody Yu	60bcef000e	[Docs][V1] Prefix caching design (#12598 ) - Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-01-31 12:30:46 -08:00
Cody Yu	847f883232	[Git] Automatically sign-off commits (#12595 ) It's very annoying when I forgot to add `-s` in `git commit` to sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git push -f` to fix the DCO. This PR adds a hook to sign off commits automatically when `-s` is missing to solve this problem. The only change from the user side is now users have to install 2 hooks, so instead of just ``` pre-commit install ``` Now we need to ``` pre-commit install --hook-type pre-commit --hook-type commit-msg ``` Note that even if users still only install the pre-commit hook, they won't get any error in `git commit`. Just the sign-off hook won't run. cc @hmellor @youkaichao --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>	2025-01-31 12:30:33 -08:00
Robert Shaw	325f679f32	[BugFix] Fix Torch.Compile For DeepSeek (#12594 ) Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-31 12:06:39 -08:00
Harry Mellor	e3f7ff65e7	Add favicon to docs (#12611 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-31 09:20:34 -08:00
Roger Wang	7a8987dac5	[Bugfix] Gracefully handle huggingface hub http error (#12571 )	2025-01-31 08:19:35 +00:00
Lucas Wilkinson	cabaf4eff3	[Attention] MLA decode optimizations (#12528 ) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-30 23:49:37 -08:00
Aleksandr Malyshev	a1fc18c030	[ROCm][AMD][Model] llama 3.2 support upstreaming (#12421 ) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>	2025-01-31 12:24:28 +08:00
Lucas Wilkinson	9798b2fb00	[Kernel] Update `cutlass_scaled_mm` to support 2d group (blockwise) scaling (#11868 )	2025-01-30 18:33:00 -08:00
Michael Goin	4078052f09	[V1][Log] Add max request concurrency log to V1 (#12569 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2025-01-30 23:07:19 +00:00
Nishidha	bd2107e30a	[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555 ) Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>	2025-01-30 16:29:39 -05:00
Robert Shaw	9b0c4bab36	[Kernel] Triton Configs for Fp8 Block Quantization (#11589 ) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu>	2025-01-30 11:53:22 -08:00
Beim	41bf5612f5	[Misc] fix typo: add missing space in lora adapter error message (#12564 ) Signed-off-by: Beim <beim2015@outlook.com>	2025-01-30 15:39:22 +00:00
Harry Mellor	a2769032ca	Set `?device={device}` when changing tab in installation guides (#12560 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-30 00:05:42 -08:00
Mark McLoughlin	f17f1d4608	[V1][Metrics] Add GPU cache usage % gauge (#12561 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-29 18:31:01 -08:00
Divakar Verma	1c1bb0bbf2	[Misc][MoE] add Deepseek-V3 moe tuning support (#12558 ) Signed-off-by: Divakar Verma <divakar.verma@amd.com>	2025-01-30 00:47:30 +00:00
Woosuk Kwon	e0cc5f259a	[V1][BugFix] Free encoder cache for aborted requests (#12545 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-01-29 13:47:33 -08:00
Tyler Michael Smith	73aa6cfdf7	Revert "[Build/CI] Fix libcuda.so linkage" (#12552 )	2025-01-29 21:12:24 +00:00
Jinzhen Lin	27b78c73ca	[Kernel] add triton fused moe kernel for gptq/awq (#12185 )	2025-01-29 09:07:09 -05:00
Pavani Majety	b02fd288b2	[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2025-01-29 01:46:12 -08:00
Yanyi Liu	ff7424f491	[Frontend] Support override generation config in args (#12409 ) Signed-off-by: liuyanyi <wolfsonliu@163.com>	2025-01-29 01:41:01 -08:00
Alphi	d93bf4da85	[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069 ) Signed-off-by: hzh <hezhihui_thu@163.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by: Chenguang Li <757486878@qq.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Shanshan Shen <467638484@qq.com> Signed-off-by: elijah <f1renze.142857@gmail.com> Signed-off-by: Yikun <yikunkero@gmail.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: sixgod <evethwillbeok@outlook.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by: Concurrensee <yida.wu@amd.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2025-01-29 09:24:59 +00:00
Travis Johnson	036ca94c25	[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Wallas Santos <wallashss@ibm.com>	2025-01-29 08:54:35 +00:00
Maximilien de Bayser	ef001d98ef	Fix the pydantic logging validator (#12420 ) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>	2025-01-29 07:53:13 +00:00
Robert Shaw	5f671cb4c3	[V1] Improve Error Message for Unsupported Config (#12535 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2025-01-29 04:56:56 +00:00
Michael Goin	bd02164cf9	Bugfix for whisper quantization due to fake k_proj bias (#12524 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2025-01-29 04:49:03 +00:00
Mark McLoughlin	46fb056749	[V1][Metrics] Add TTFT and TPOT histograms (#12530 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-29 04:11:16 +00:00
Harry Mellor	dd6a3a02cb	[Doc] Convert docs to use colon fences (#12471 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-29 11:38:29 +08:00
Ce Gao	a7e3eba66f	[Frontend] Support reasoning content for deepseek r1 (#12473 ) Signed-off-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Michael Goin <mgoin@redhat.com>	2025-01-29 11:38:08 +08:00
Michael Goin	fbb5bd4cef	[TPU] Add example for profiling TPU inference (#12531 ) Signed-off-by: mgoin <mgoin@redhat.com>	2025-01-29 03:16:47 +00:00
fenghuizhang	80fcc3ed1c	[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482 ) Signed-off-by: Fenghui Zhang <fhzhang@google.com>	2025-01-28 22:36:44 +00:00
Mark McLoughlin	c386c43ca3	[V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2025-01-28 22:07:22 +00:00
Harry Mellor	f26d790718	Do not run `suggestion` `pre-commit` hook multiple times (#12521 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-01-28 20:05:27 +00:00
Michael Goin	0f657bdc52	Replace missed warning_once for rerank API (#12472 ) Signed-off-by: mgoin <michael@neuralmagic.com>	2025-01-28 19:06:32 +00:00

1 2 3 4 5 ...

4391 Commits