2318 Commits

Author SHA1 Message Date
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126) 2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray (#7424)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
jack
67d115db08
[Bugfix][Frontend] Disable embedding API for chat models (#7504)
Co-authored-by: jack <jack@alex>
2024-08-14 09:15:19 -07:00
youkaichao
d3d9cb6e4b
[ci] fix model tests (#7507) 2024-08-14 01:01:43 -07:00
Chang Su
c134a46402
Fix empty output when temp is too low (#2937)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-08-14 05:31:44 +00:00
youkaichao
199adbb7cf
[doc] update test script to include cudagraph (#7501) 2024-08-13 21:52:58 -07:00
Cyrus Leung
dd164d72f3
[Bugfix][Docs] Update list of mock imports (#7493) 2024-08-13 20:37:30 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins (#7489) 2024-08-13 19:27:46 -07:00
Jee Jee Li
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests (#7396) 2024-08-13 17:27:29 -07:00
Woosuk Kwon
59edd0f134
[Bugfix][CI] Import ray under guard (#7486) 2024-08-13 17:12:58 -07:00
Woosuk Kwon
a08df8322e
[TPU] Support multi-host inference (#7457) 2024-08-13 16:31:20 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation (#7426) 2024-08-13 16:24:17 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors code reuse (#7277) 2024-08-13 19:05:15 -04:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process (#7484) 2024-08-13 15:40:17 -07:00
Simon Mo
c5c7768264
Announce NVIDIA Meetup (#7483)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-13 14:28:36 -07:00
Dipika Sikka
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters (#7422) 2024-08-13 17:08:20 -04:00
Dipika Sikka
d3bdfd3ab9
[Misc] Update Fused MoE weight loading (#7334) 2024-08-13 14:57:45 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters (#7281) 2024-08-13 14:30:11 -04:00
Dipika Sikka
181abbc27d
[Misc] Update LM Eval Tolerance (#7473) 2024-08-13 14:28:14 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models (#7446) 2024-08-13 17:39:33 +00:00
Woosuk Kwon
e20233d361
Revert "[Doc] Update supported_hardware.rst (#7276)" (#7467) 2024-08-13 01:37:08 -07:00
Woosuk Kwon
d6e634f3d7
[TPU] Suppress import custom_ops warning (#7458) 2024-08-13 00:30:30 -07:00
youkaichao
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) 2024-08-13 00:16:42 -07:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) 2024-08-13 05:33:41 +00:00
Kevin H. Luu
5469146bcc
[ci] Remove fast check cancel workflow (#7455) 2024-08-12 21:19:51 -07:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message (#7435) 2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements (#7417) 2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit (#7224)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-12 17:57:16 -07:00
Daniele
774cd1d3bf
[CI/Build] bump minimum cmake version (#6999) 2024-08-12 16:29:20 -07:00
sasha0552
91294d56e1
[Bugfix] Handle PackageNotFoundError when checking for xpu version (#7398) 2024-08-12 16:07:20 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments (#7416) 2024-08-12 14:14:14 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels (#7323) 2024-08-12 14:40:13 -04:00
Kevin H. Luu
1137f343aa
[ci] Cancel fastcheck when PR is ready (#7433)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:59:14 -07:00
Kevin H. Luu
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready (#7427)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:56:52 -07:00
Kevin H. Luu
65950e8f58
[ci] Entrypoints run upon changes in vllm/ (#7423)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:18:03 -07:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend (#7425) 2024-08-12 09:58:28 -07:00
Daniele
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR (#6832) 2024-08-12 09:53:35 -07:00
Cyrus Leung
24154f8618
[Frontend] Disallow passing model as both argument and option (#7347) 2024-08-12 12:58:34 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input (#6613) 2024-08-12 16:16:06 +08:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319) 2024-08-12 07:59:17 +00:00
Roger Wang
86ab567bae
[CI/Build] Minor refactoring for vLLM assets (#7407) 2024-08-12 02:41:52 +00:00
Simon Mo
f020a6297e
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00
youkaichao
6c8e595710
[misc] add commit id in collect env (#7405) 2024-08-11 15:40:48 -07:00
tomeras91
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty (#7403) 2024-08-11 14:38:17 -07:00
tomeras91
386087970a
[CI/Build] build on empty device for better dev experience (#4773) 2024-08-11 13:09:44 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step (#7387) 2024-08-11 08:50:08 -07:00
Noam Gat
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 (#7189) 2024-08-11 08:11:50 -04:00
Woosuk Kwon
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time (#7340) 2024-08-10 18:12:22 -07:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392) 2024-08-10 16:19:33 +00:00