Michael Goin
|
b3f4e17935
|
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444)
|
2024-08-16 13:59:16 -07:00 |
|
Mahesh Keralapura
|
93478b63d2
|
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
|
2024-08-16 13:46:01 -07:00 |
|
William Lin
|
f366f6339b
|
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-16 11:41:56 -07:00 |
|
Michael Goin
|
855866caa9
|
[Kernel] Add tuned triton configs for ExpertsInt8 (#7601)
|
2024-08-16 11:37:01 -07:00 |
|
Mor Zusman
|
7fc23be81c
|
[Kernel] W8A16 Int8 inside FusedMoE (#7415)
|
2024-08-16 10:06:51 -07:00 |
|
Charlie Fu
|
e837b624f2
|
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210)
|
2024-08-16 10:06:30 -07:00 |
|
fzyzcjy
|
ec724a725e
|
support tqdm in notebooks (#7510)
|
2024-08-16 09:17:50 -07:00 |
|
Gordon Wong
|
0e39a33c6d
|
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method (#7513)
|
2024-08-16 10:05:18 -06:00 |
|
Kuntai Du
|
6fc5b0f249
|
[CI] Fix crashes of performance benchmark (#7500)
|
2024-08-16 08:08:45 -07:00 |
|
Nick Hill
|
9587b050fb
|
[Core] Use uvloop with zmq-decoupled front-end (#7570)
|
2024-08-15 22:48:07 -07:00 |
|
youkaichao
|
54bd9a03c4
|
register custom op for flash attn and use from torch.ops (#7536)
|
2024-08-15 22:38:56 -07:00 |
|
jon-chuang
|
50b8d08dbd
|
[Misc/Testing] Use torch.testing.assert_close (#7324)
|
2024-08-16 04:24:04 +00:00 |
|
Michael Goin
|
e165528778
|
[CI] Move quantization cpu offload tests out of fastcheck (#7574)
|
2024-08-15 21:16:20 -07:00 |
|
nunjunj
|
3b19e39dc5
|
Chat method for offline llm (#5049)
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-08-15 19:41:34 -07:00 |
|
youkaichao
|
4cd7d47fed
|
[ci/test] rearrange tests and make adag test soft fail (#7572)
|
2024-08-15 19:39:04 -07:00 |
|
Grant Pinkert
|
f878c8feb0
|
[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453)
|
2024-08-16 02:38:08 +00:00 |
|
shangmingc
|
b67ae00cdb
|
[Misc] Add quantization config support for speculative model. (#7343)
|
2024-08-15 19:34:28 -07:00 |
|
Michael Goin
|
9c8e2d1161
|
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding (#7566)
|
2024-08-15 18:26:19 -07:00 |
|
Michael Goin
|
21313e09e3
|
[Bugfix] Fix default weight loading for scalars (#7534)
|
2024-08-15 13:10:22 -07:00 |
|
PHILO-HE
|
f4da5f7b6d
|
[Misc] Update dockerfile for CPU to cover protobuf installation (#7182)
|
2024-08-15 10:03:01 -07:00 |
|
omrishiv
|
9c1f78d5d6
|
[Bugfix] update neuron for version > 0.5.0 (#7175)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-15 09:44:14 -07:00 |
|
Woosuk Kwon
|
fc93e56143
|
[Bugfix][TPU] Correct env variable for XLA cache path (#7544)
|
2024-08-15 00:02:29 -07:00 |
|
Kameshwara Pavan Kumar Mantha
|
22b39e11f2
|
llama_index serving integration documentation (#6973)
Co-authored-by: pavanmantha <pavan.mantha@thevaslabs.io>
|
2024-08-14 15:38:37 -07:00 |
|
Kyle Sayers
|
f55a9aea45
|
[Misc] Revert compressed-tensors code reuse (#7521)
|
2024-08-14 15:07:37 -07:00 |
|
Woosuk Kwon
|
951fdd66d3
|
[TPU] Set per-rank XLA cache (#7533)
|
2024-08-14 14:47:51 -07:00 |
|
William Lin
|
2ecf7b1757
|
[core] [3/N] multi-step args and sequence.py (#7452)
|
2024-08-14 12:32:45 -07:00 |
|
Cyrus Leung
|
3f674a49b5
|
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126)
|
2024-08-14 17:55:42 +00:00 |
|
Wallas Henrique
|
70b746efcf
|
[Misc] Deprecation Warning when setting --engine-use-ray (#7424)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-08-14 09:44:27 -07:00 |
|
jack
|
67d115db08
|
[Bugfix][Frontend] Disable embedding API for chat models (#7504)
Co-authored-by: jack <jack@alex>
|
2024-08-14 09:15:19 -07:00 |
|
youkaichao
|
d3d9cb6e4b
|
[ci] fix model tests (#7507)
|
2024-08-14 01:01:43 -07:00 |
|
Chang Su
|
c134a46402
|
Fix empty output when temp is too low (#2937)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-08-14 05:31:44 +00:00 |
|
youkaichao
|
199adbb7cf
|
[doc] update test script to include cudagraph (#7501)
|
2024-08-13 21:52:58 -07:00 |
|
Cyrus Leung
|
dd164d72f3
|
[Bugfix][Docs] Update list of mock imports (#7493)
|
2024-08-13 20:37:30 -07:00 |
|
youkaichao
|
ea49e6a3c8
|
[misc][ci] fix cpu test with plugins (#7489)
|
2024-08-13 19:27:46 -07:00 |
|
Jee Jee Li
|
97992802f3
|
[CI/Build]Reduce the time consumption for LoRA tests (#7396)
|
2024-08-13 17:27:29 -07:00 |
|
Woosuk Kwon
|
59edd0f134
|
[Bugfix][CI] Import ray under guard (#7486)
|
2024-08-13 17:12:58 -07:00 |
|
Woosuk Kwon
|
a08df8322e
|
[TPU] Support multi-host inference (#7457)
|
2024-08-13 16:31:20 -07:00 |
|
youkaichao
|
16422ea76f
|
[misc][plugin] add plugin system implementation (#7426)
|
2024-08-13 16:24:17 -07:00 |
|
Kyle Sayers
|
373538f973
|
[Misc] compressed-tensors code reuse (#7277)
|
2024-08-13 19:05:15 -04:00 |
|
youkaichao
|
33e5d7e6b6
|
[frontend] spawn engine process from api server process (#7484)
|
2024-08-13 15:40:17 -07:00 |
|
Simon Mo
|
c5c7768264
|
Announce NVIDIA Meetup (#7483)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-08-13 14:28:36 -07:00 |
|
Dipika Sikka
|
b1e5afc3e7
|
[Misc] Update awq and awq_marlin to use vLLMParameters (#7422)
|
2024-08-13 17:08:20 -04:00 |
|
Dipika Sikka
|
d3bdfd3ab9
|
[Misc] Update Fused MoE weight loading (#7334)
|
2024-08-13 14:57:45 -04:00 |
|
Dipika Sikka
|
fb377d7e74
|
[Misc] Update gptq_marlin to use new vLLMParameters (#7281)
|
2024-08-13 14:30:11 -04:00 |
|
Dipika Sikka
|
181abbc27d
|
[Misc] Update LM Eval Tolerance (#7473)
|
2024-08-13 14:28:14 -04:00 |
|
Peter Salas
|
00c3d68e45
|
[Frontend][Core] Add plumbing to support audio language models (#7446)
|
2024-08-13 17:39:33 +00:00 |
|
Woosuk Kwon
|
e20233d361
|
Revert "[Doc] Update supported_hardware.rst (#7276)" (#7467)
|
2024-08-13 01:37:08 -07:00 |
|
Woosuk Kwon
|
d6e634f3d7
|
[TPU] Suppress import custom_ops warning (#7458)
|
2024-08-13 00:30:30 -07:00 |
|
youkaichao
|
4d2dc5072b
|
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102)
|
2024-08-13 00:16:42 -07:00 |
|
Cyrus Leung
|
7025b11d94
|
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410)
|
2024-08-13 05:33:41 +00:00 |
|