161 Commits

Author SHA1 Message Date
vlsav
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997) 2024-10-01 11:07:06 -07:00
Chen Zhang
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891) 2024-09-28 18:51:22 +00:00
Peter Pan
0e088750af
[MISC] Fix invalid escape sequence '\' (#8830)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2024-09-27 01:13:25 -07:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs (#8876) 2024-09-26 20:35:15 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810) 2024-09-25 10:36:26 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility (#8760) 2024-09-25 09:36:47 -07:00
Archit Patke
6da1ab6b41
[Core] Adding Priority Scheduling (#5958) 2024-09-24 19:50:50 -07:00
Simon Mo
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" (#8750) 2024-09-24 05:45:20 +00:00
youkaichao
0250dd68c5
re-implement beam search on top of vllm core (#8726)
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
2024-09-23 22:08:12 -07:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Cyrus Leung
0057894ef7
[Core] Rename PromptInputs and inputs(#8673) 2024-09-20 19:00:54 -07:00
Kunshang Ji
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py (#8615) 2024-09-18 22:33:20 -07:00
Kuntai Du
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) 2024-09-19 05:24:24 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Isotr0py
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) 2024-09-17 07:34:27 +00:00
Aarni Koskela
8baa454937
[Misc] Move device options to a single place (#8322) 2024-09-11 13:25:58 -07:00
Wei-Sheng Chin
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241) 2024-09-06 20:18:16 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191) 2024-09-06 09:01:14 -07:00
Cody Yu
77d9e514a2
[MISC] Replace input token throughput with total token throughput (#8164)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-04 20:23:22 +00:00
Nick Hill
d4db9f53c8
[Benchmark] Add --async-engine option to benchmark_throughput.py (#7964) 2024-09-03 20:57:41 -04:00
Wei-Sheng Chin
0c785d344d
Add more percentiles and latencies (#7759) 2024-08-29 16:48:11 -07:00
Philipp Schmid
345be0e244
[benchmark] Update TGI version (#7917) 2024-08-27 15:07:53 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
Jiaxin Shan
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool (#6568) 2024-08-22 09:32:02 -07:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
William Lin
dd53c4b023
[misc] Add Torch profiler support (#7451)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-21 15:39:26 -07:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) 2024-08-20 07:09:33 -06:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE (#7415) 2024-08-16 10:06:51 -07:00
Roger Wang
70d268a399
[Bugfix] Fix ITL recording in serving benchmark (#7372) 2024-08-09 10:00:00 -07:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
Varun Sundar Rabindranath
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
Joe
14dbd5a767
[Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
Woosuk Kwon
a9a2e74d21
[Misc] Use torch.Tensor for type annotation (#6505) 2024-07-17 13:01:10 +00:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) 2024-07-16 15:31:32 -07:00
Fish
ccb20db8bd
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428) 2024-07-14 19:27:01 -07:00
Ethan Xu
dbfe254eda
[Feature] vLLM CLI (#5090)
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-07-14 15:36:43 -07:00
Kuntai Du
a4feba929b
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362) 2024-07-11 13:28:38 -07:00
Robert Shaw
b675069d74
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-07-11 15:40:11 +00:00
Roger Wang
c4774eb841
[Bugfix] Fix snapshot download in serving benchmark (#6318) 2024-07-11 07:04:05 +00:00
Haichuan
717f4bcea0
Feature/add benchmark testing (#5947)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-08 07:52:06 +00:00
Haichuan
333306a252
add benchmark for fix length input and output (#5857)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-07 07:42:13 +00:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) 2024-07-01 20:10:37 -07:00
James Whedbee
e373853e12
[Frontend] Relax api url assertion for openai benchmarking (#6046) 2024-07-01 23:39:10 +00:00
zhyncs
bb60326836
[Misc] update benchmark backend for scalellm (#6018) 2024-07-01 10:20:33 -07:00