Kuntai Du
|
fbb74420e7
|
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412)
|
2024-10-04 14:01:44 -07:00 |
|
vlsav
|
22f5851b80
|
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997)
|
2024-10-01 11:07:06 -07:00 |
|
Chen Zhang
|
e585b583a9
|
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891)
|
2024-09-28 18:51:22 +00:00 |
|
Peter Pan
|
0e088750af
|
[MISC] Fix invalid escape sequence '\' (#8830)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
|
2024-09-27 01:13:25 -07:00 |
|
Cyrus Leung
|
3b00b9c26c
|
[Core] renamePromptInputs and inputs (#8876)
|
2024-09-26 20:35:15 -07:00 |
|
Simon Mo
|
4f1ba0844b
|
Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810)
|
2024-09-25 10:36:26 -07:00 |
|
Cyrus Leung
|
28e1299e60
|
rename PromptInputs and inputs with backward compatibility (#8760)
|
2024-09-25 09:36:47 -07:00 |
|
Archit Patke
|
6da1ab6b41
|
[Core] Adding Priority Scheduling (#5958)
|
2024-09-24 19:50:50 -07:00 |
|
Simon Mo
|
3185fb0cca
|
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" (#8750)
|
2024-09-24 05:45:20 +00:00 |
|
youkaichao
|
0250dd68c5
|
re-implement beam search on top of vllm core (#8726)
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>
|
2024-09-23 22:08:12 -07:00 |
|
Lucas Wilkinson
|
86e9c8df29
|
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-09-23 13:46:26 -04:00 |
|
Cyrus Leung
|
0057894ef7
|
[Core] Rename PromptInputs and inputs(#8673)
|
2024-09-20 19:00:54 -07:00 |
|
Kunshang Ji
|
855c8ae2c9
|
[MISC] remove engine_use_ray in benchmark_throughput.py (#8615)
|
2024-09-18 22:33:20 -07:00 |
|
Kuntai Du
|
c52ec5f034
|
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616)
|
2024-09-19 05:24:24 +00:00 |
|
Aaron Pham
|
9d104b5beb
|
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-09-18 11:00:56 +00:00 |
|
Cyrus Leung
|
6ffa3f314c
|
[CI/Build] Avoid CUDA initialization (#8534)
|
2024-09-18 10:38:11 +00:00 |
|
Isotr0py
|
1b6de8352b
|
[Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495)
|
2024-09-17 07:34:27 +00:00 |
|
Aarni Koskela
|
8baa454937
|
[Misc] Move device options to a single place (#8322)
|
2024-09-11 13:25:58 -07:00 |
|
Wei-Sheng Chin
|
795b662cff
|
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241)
|
2024-09-06 20:18:16 -07:00 |
|
afeldman-nm
|
e5cab71531
|
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191)
|
2024-09-06 09:01:14 -07:00 |
|
Cody Yu
|
77d9e514a2
|
[MISC] Replace input token throughput with total token throughput (#8164)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-09-04 20:23:22 +00:00 |
|
Nick Hill
|
d4db9f53c8
|
[Benchmark] Add --async-engine option to benchmark_throughput.py (#7964)
|
2024-09-03 20:57:41 -04:00 |
|
Wei-Sheng Chin
|
0c785d344d
|
Add more percentiles and latencies (#7759)
|
2024-08-29 16:48:11 -07:00 |
|
Philipp Schmid
|
345be0e244
|
[benchmark] Update TGI version (#7917)
|
2024-08-27 15:07:53 -07:00 |
|
Megha Agarwal
|
2eedede875
|
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
|
2024-08-26 20:53:20 -07:00 |
|
Alexander Matveev
|
9db93de20c
|
[Core] Add multi-step support to LLMEngine (#7789)
|
2024-08-23 12:45:53 -07:00 |
|
Jiaxin Shan
|
d3b5b98021
|
[Misc] Enhance prefix-caching benchmark tool (#6568)
|
2024-08-22 09:32:02 -07:00 |
|
Luka Govedič
|
7937009a7e
|
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-21 20:18:00 -04:00 |
|
William Lin
|
dd53c4b023
|
[misc] Add Torch profiler support (#7451)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-21 15:39:26 -07:00 |
|
Lucas Wilkinson
|
5288c06aa0
|
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
|
2024-08-20 07:09:33 -06:00 |
|
Mor Zusman
|
7fc23be81c
|
[Kernel] W8A16 Int8 inside FusedMoE (#7415)
|
2024-08-16 10:06:51 -07:00 |
|
Roger Wang
|
70d268a399
|
[Bugfix] Fix ITL recording in serving benchmark (#7372)
|
2024-08-09 10:00:00 -07:00 |
|
Luka Govedič
|
8d59dbb000
|
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-08-06 18:17:08 +00:00 |
|
Lucas Wilkinson
|
a8d604ca2a
|
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
|
2024-08-02 13:51:58 -07:00 |
|
Varun Sundar Rabindranath
|
35e9c12bfa
|
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-31 14:40:32 -07:00 |
|
Varun Sundar Rabindranath
|
766435e660
|
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 09:42:35 -06:00 |
|
Alexander Matveev
|
75acdaa4b6
|
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795)
|
2024-07-27 17:52:33 -04:00 |
|
Joe
|
14dbd5a767
|
[Model] H2O Danube3-4b (#6451)
|
2024-07-26 20:47:50 -07:00 |
|
Cyrus Leung
|
739b61a348
|
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-22 10:13:53 -07:00 |
|
Woosuk Kwon
|
a9a2e74d21
|
[Misc] Use torch.Tensor for type annotation (#6505)
|
2024-07-17 13:01:10 +00:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Fish
|
ccb20db8bd
|
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428)
|
2024-07-14 19:27:01 -07:00 |
|
Ethan Xu
|
dbfe254eda
|
[Feature] vLLM CLI (#5090)
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-07-14 15:36:43 -07:00 |
|
Kuntai Du
|
a4feba929b
|
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
|
2024-07-11 13:28:38 -07:00 |
|
Robert Shaw
|
b675069d74
|
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-07-11 15:40:11 +00:00 |
|
Roger Wang
|
c4774eb841
|
[Bugfix] Fix snapshot download in serving benchmark (#6318)
|
2024-07-11 07:04:05 +00:00 |
|
Haichuan
|
717f4bcea0
|
Feature/add benchmark testing (#5947)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-08 07:52:06 +00:00 |
|
Haichuan
|
333306a252
|
add benchmark for fix length input and output (#5857)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-07 07:42:13 +00:00 |
|
Alexander Matveev
|
3476ed0809
|
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602)
|
2024-07-01 20:10:37 -07:00 |
|
James Whedbee
|
e373853e12
|
[Frontend] Relax api url assertion for openai benchmarking (#6046)
|
2024-07-01 23:39:10 +00:00 |
|