Simon Mo
|
7eb0cb4a14
|
Revert "[Frontend] Factor out code for running uvicorn" (#7012)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-07-31 16:34:26 -07:00 |
|
Michael Goin
|
a0dce9383a
|
[Misc] Add compressed-tensors to optimized quant list (#7006)
|
2024-07-31 14:40:44 -07:00 |
|
Varun Sundar Rabindranath
|
35e9c12bfa
|
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-31 14:40:32 -07:00 |
|
Varun Sundar Rabindranath
|
93548eb37e
|
[Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-31 14:40:22 -07:00 |
|
Michael Goin
|
460c1884e3
|
[Bugfix] Support cpu offloading with fp8 quantization (#6960)
|
2024-07-31 12:47:46 -07:00 |
|
Cody Yu
|
bd70013407
|
[MISC] Introduce pipeline parallelism partition strategies (#6920)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-07-31 12:02:17 -07:00 |
|
Avshalom Manevich
|
2ee8d3ba55
|
[Model] use FusedMoE layer in Jamba (#6935)
|
2024-07-31 12:00:24 -07:00 |
|
Cyrus Leung
|
daed30c4a9
|
[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982)
|
2024-07-31 23:46:17 +08:00 |
|
Alphi
|
2f4e108f75
|
[Bugfix] Clean up MiniCPM-V (#6939)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-31 14:39:19 +00:00 |
|
HandH1998
|
6512937de1
|
Support W4A8 quantization for vllm (#5218)
|
2024-07-31 07:55:21 -06:00 |
|
Fei
|
c0644cf9ce
|
[Bugfix] fix logit processor excceed vocab size issue (#6927)
|
2024-07-31 16:16:01 +08:00 |
|
Woosuk Kwon
|
533d1932d2
|
[Bugfix][TPU] Set readonly=True for non-root devices (#6980)
|
2024-07-31 00:19:28 -07:00 |
|
Cyrus Leung
|
9f0e69b653
|
[CI/Build] Fix mypy errors (#6968)
|
2024-07-30 19:49:48 -07:00 |
|
Cyrus Leung
|
f230cc2ca6
|
[Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836)
|
2024-07-31 10:38:45 +08:00 |
|
Cyrus Leung
|
da1f7cc12a
|
[mypy] Enable following imports for some directories (#6681)
|
2024-07-31 10:38:03 +08:00 |
|
Cade Daniel
|
c32ab8be1a
|
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding (#6964)
|
2024-07-31 00:53:21 +00:00 |
|
Cade Daniel
|
fb4f530bf5
|
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists (#6706)
|
2024-07-30 16:28:49 -07:00 |
|
Cade Daniel
|
79319cedfa
|
[Nightly benchmarking suite] Remove pkill python from run benchmark suite (#6965)
|
2024-07-30 16:28:05 -07:00 |
|
Simon Mo
|
40c27a7cbb
|
[Build] Temporarily Disable Kernels and LoRA tests (#6961)
|
2024-07-30 14:59:48 -07:00 |
|
youkaichao
|
6ca8031e71
|
[core][misc] improve free_finished_seq_groups (#6865)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-07-30 14:32:12 -07:00 |
|
Tyler Michael Smith
|
d7a299edaa
|
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842)
|
2024-07-30 16:37:01 -04:00 |
|
Sanger Steel
|
052b6f8ca4
|
[Bugfix] Fix tensorizer memory profiling bug during testing (#6881)
|
2024-07-30 11:48:50 -07:00 |
|
Ilya Lavrenov
|
5895b24677
|
[OpenVINO] Updated OpenVINO requirements and build docs (#6948)
|
2024-07-30 11:33:01 -07:00 |
|
Tyler Michael Smith
|
cbbc904470
|
[Kernel] Squash a few more warnings (#6914)
|
2024-07-30 13:50:42 -04:00 |
|
Nick Hill
|
5cf9254a9c
|
[BugFix] Fix use of per-request seed with pipeline parallel (#6698)
|
2024-07-30 10:40:08 -07:00 |
|
fzyzcjy
|
f058403683
|
[Doc] Super tiny fix doc typo (#6949)
|
2024-07-30 09:14:03 -07:00 |
|
Roger Wang
|
c66c7f86ac
|
[Bugfix] Fix PaliGemma MMP (#6930)
|
2024-07-30 02:20:57 -07:00 |
|
Woosuk Kwon
|
6e063ea35b
|
[TPU] Fix greedy decoding (#6933)
|
2024-07-30 02:06:29 -07:00 |
|
Varun Sundar Rabindranath
|
af647fb8b3
|
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 20:24:58 -06:00 |
|
Tyler Michael Smith
|
61a97c32f6
|
[Kernel] Fix marlin divide-by-zero warnings (#6904)
|
2024-07-30 01:26:07 +00:00 |
|
Kevin H. Luu
|
4fbf4aa128
|
[ci] GHA workflow to remove ready label upon "/notready" comment (#6921)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-07-29 17:03:45 -07:00 |
|
Tyler Michael Smith
|
aae6d36f7e
|
[Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908)
|
2024-07-29 18:01:17 -06:00 |
|
Nick Hill
|
9f69d8245a
|
[Frontend] New allowed_token_ids decoding request parameter (#6753)
|
2024-07-29 23:37:27 +00:00 |
|
Thomas Parnell
|
9a7e2d0534
|
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-29 14:51:27 -07:00 |
|
Earthwalker
|
7f8d612d24
|
[TPU] Support tensor parallelism in async llm engine (#6891)
|
2024-07-29 12:42:21 -07:00 |
|
Tyler Michael Smith
|
60d1c6e584
|
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901)
|
2024-07-29 09:59:02 -07:00 |
|
Peng Guanwen
|
db9e5708a9
|
[Core] Reduce unnecessary compute when logprobs=None (#6532)
|
2024-07-29 16:47:31 +00:00 |
|
Varun Sundar Rabindranath
|
766435e660
|
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 09:42:35 -06:00 |
|
Isotr0py
|
7cbd9ec7a9
|
[Model] Initialize support for InternVL2 series models (#6514)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-29 10:16:30 +00:00 |
|
Elsa Granger
|
3eeb148f46
|
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (#6871)
|
2024-07-28 11:13:49 -04:00 |
|
Michael Goin
|
b1366a9534
|
Add Nemotron to PP_SUPPORTED_MODELS (#6863)
|
2024-07-27 15:05:17 -07:00 |
|
Alexander Matveev
|
75acdaa4b6
|
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795)
|
2024-07-27 17:52:33 -04:00 |
|
Woosuk Kwon
|
fad5576c58
|
[TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856)
|
2024-07-27 10:28:33 -07:00 |
|
Chenggang Wu
|
f954d0715c
|
[Docs] Add RunLLM chat widget (#6857)
|
2024-07-27 09:24:46 -07:00 |
|
Cyrus Leung
|
1ad86acf17
|
[Model] Initial support for BLIP-2 (#5920)
Co-authored-by: ywang96 <ywang@roblox.com>
|
2024-07-27 11:53:07 +00:00 |
|
Roger Wang
|
ecb33a28cb
|
[CI/Build][Doc] Update CI and Doc for VLM example changes (#6860)
|
2024-07-27 09:54:14 +00:00 |
|
Wang Ran (汪然)
|
a57d75821c
|
[bugfix] make args.stream work (#6831)
|
2024-07-27 09:07:02 +00:00 |
|
Roger Wang
|
925de97e05
|
[Bugfix] Fix VLM example typo (#6859)
|
2024-07-27 14:24:08 +08:00 |
|
Roger Wang
|
aa46953a20
|
[Misc][VLM][Doc] Consolidate offline examples for vision language models (#6858)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-07-26 22:44:13 -07:00 |
|
Travis Johnson
|
593e79e733
|
[Bugfix] torch.set_num_threads() in multiproc_gpu_executor (#6802)
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor (#6802)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-07-26 22:15:20 -07:00 |
|