229 Commits

Author SHA1 Message Date
Kyle Mistele
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649)
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-04 13:18:13 -07:00
alexeykondrat
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-04 11:57:54 -07:00
Cody Yu
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests (#8131) 2024-09-04 18:53:25 +00:00
TimWang
ccd7207191
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (#8103) 2024-09-03 23:17:05 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
Michael Goin
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961) 2024-08-28 19:19:17 -04:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-28 16:10:12 -07:00
alexeykondrat
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237) 2024-08-27 10:09:13 -07:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling (#7796) 2024-08-26 21:33:58 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines (#7831) 2024-08-24 00:51:38 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. (#7584)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-22 17:44:25 -07:00
youkaichao
8c6f694a79
[ci] refine dependency for distributed tests (#7776) 2024-08-22 00:54:15 -07:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
William Lin
5844017285
[ci] [multi-step] narrow multi-step test dependency paths (#7760) 2024-08-21 15:52:40 -07:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
Ronen Schaffer
2aa00d59ad
[CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266)
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266)
2024-08-20 10:02:21 -07:00
Kuntai Du
3d8a5f063d
[CI] Organizing performance benchmark files (#7616) 2024-08-19 22:43:54 -07:00
William Lin
47b65a5508
[core] Multi Step Scheduling (#7000)
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available (#7137)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-19 03:24:03 +00:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models (#7475)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-08-19 07:43:46 +08:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests (#7600) 2024-08-16 20:49:30 -07:00
Alexei-V-Ivanov-AMD
6bd19551b0
.[Build/CI] Enabling passing AMD tests. (#7610) 2024-08-16 20:25:32 -07:00
Michael Goin
44f26a9466
[Model] Align nemotron config with final HF state and fix lm-eval-small (#7611) 2024-08-16 15:56:34 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
2024-08-16 13:46:01 -07:00
Kuntai Du
6fc5b0f249
[CI] Fix crashes of performance benchmark (#7500) 2024-08-16 08:08:45 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops (#7536) 2024-08-15 22:38:56 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm (#5049)
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
youkaichao
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail (#7572) 2024-08-15 19:39:04 -07:00
PHILO-HE
f4da5f7b6d
[Misc] Update dockerfile for CPU to cover protobuf installation (#7182) 2024-08-15 10:03:01 -07:00
youkaichao
d3d9cb6e4b
[ci] fix model tests (#7507) 2024-08-14 01:01:43 -07:00
Cyrus Leung
dd164d72f3
[Bugfix][Docs] Update list of mock imports (#7493) 2024-08-13 20:37:30 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins (#7489) 2024-08-13 19:27:46 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation (#7426) 2024-08-13 16:24:17 -07:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters (#7281) 2024-08-13 14:30:11 -04:00
Dipika Sikka
181abbc27d
[Misc] Update LM Eval Tolerance (#7473) 2024-08-13 14:28:14 -04:00
Kevin H. Luu
65950e8f58
[ci] Entrypoints run upon changes in vllm/ (#7423)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:18:03 -07:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319) 2024-08-12 07:59:17 +00:00
Kevin H. Luu
469b3bc538
[ci] Make building wheels per commit optional (#7278)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-07 11:34:25 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Dipika Sikka
a3bbbfa1d8
[BugFix] Fix DeepSeek remote code (#7178) 2024-08-06 08:16:53 -07:00
Simon Mo
e3c664bfcb
[Build] Add initial conditional testing spec (#6841) 2024-08-05 17:39:22 -07:00
Kuntai Du
67d745cc68
[CI] Temporarily turn off H100 performance benchmark (#7104) 2024-08-02 23:52:44 -07:00
youkaichao
04e5583425
[ci][distributed] merge distributed test commands (#7097)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-02 21:33:53 -07:00
omkar kakarparthi
562e580abc
Update run-amd-test.sh (#7044) 2024-08-01 13:12:37 -07:00
Sage Moore
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 (#6951)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-01 11:11:24 -07:00
Alexei-V-Ivanov-AMD
a72a424b3e
[Build/CI] Fixing Docker Hub quota issue. (#7043) 2024-08-01 11:07:37 -07:00
HandH1998
6512937de1
Support W4A8 quantization for vllm (#5218) 2024-07-31 07:55:21 -06:00
Cyrus Leung
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836) 2024-07-31 10:38:45 +08:00
Cade Daniel
c32ab8be1a
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding (#6964) 2024-07-31 00:53:21 +00:00