youkaichao
|
1b44aaf4e3
|
[bugfix][distributed] fix 16 gpus local rank arrangement (#5604)
|
2024-06-17 21:35:04 +00:00 |
|
Kunshang Ji
|
728c4c8a06
|
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
|
2024-06-17 11:01:25 -07:00 |
|
Dipika Sikka
|
890d8d960b
|
[Kernel] compressed-tensors marlin 24 support (#5435)
|
2024-06-17 12:32:48 -04:00 |
|
Charles Riggins
|
9e74d9d003
|
Correct alignment in the seq_len diagram. (#5592)
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>
|
2024-06-17 12:05:33 -04:00 |
|
Amit Garg
|
9333fb8eb9
|
[Model] Rename Phi3 rope scaling type (#5595)
|
2024-06-17 12:04:14 -04:00 |
|
zifeitong
|
3ce2c050dd
|
[Fix] Correct OpenAI batch response format (#5554)
|
2024-06-15 16:57:54 -07:00 |
|
Nick Hill
|
1c0afa13c5
|
[BugFix] Don't start a Ray cluster when not using Ray (#5570)
|
2024-06-15 16:30:51 -07:00 |
|
SangBin Cho
|
e691918e3b
|
[misc] Do not allow to use lora with chunked prefill. (#5538)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-06-15 14:59:36 +00:00 |
|
Cyrus Leung
|
0e9164b40a
|
[mypy] Enable type checking for test directory (#5017)
|
2024-06-15 04:45:31 +00:00 |
|
leiwen83
|
1b8a0d71cf
|
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-06-14 17:23:56 -07:00 |
|
youkaichao
|
f5bb85b435
|
[Core][Distributed] improve p2p cache generation (#5528)
|
2024-06-14 14:47:45 -07:00 |
|
Woosuk Kwon
|
28c145eb57
|
[Bugfix] Fix typo in Pallas backend (#5558)
|
2024-06-14 14:40:09 -07:00 |
|
Thomas Parnell
|
e2afb03c92
|
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-06-14 20:28:11 +00:00 |
|
Sanger Steel
|
6e2527a7cb
|
[Doc] Update documentation on Tensorizer (#5471)
|
2024-06-14 11:27:57 -07:00 |
|
youkaichao
|
d1c3d7d139
|
[misc][distributed] fix benign error in is_in_the_same_node (#5512)
|
2024-06-14 10:59:28 -07:00 |
|
Cyrus Leung
|
77490c6f2f
|
[Core] Remove duplicate processing in async engine (#5525)
|
2024-06-14 10:04:42 -07:00 |
|
Robert Shaw
|
15985680e2
|
[ Misc ] Rs/compressed tensors cleanup (#5432)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
|
2024-06-14 10:01:46 -07:00 |
|
Tyler Michael Smith
|
703475f6c2
|
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
|
2024-06-14 09:30:15 -07:00 |
|
Simon Mo
|
0f0d8bc065
|
bump version to v0.5.0.post1 (#5522)
|
2024-06-13 19:42:06 -07:00 |
|
Antoni Baum
|
50eed24d25
|
Add cuda_device_count_stateless (#5473)
|
2024-06-13 16:06:49 -07:00 |
|
Tyler Michael Smith
|
e38042d4af
|
[Kernel] Disable CUTLASS kernels for fp8 (#5505)
|
2024-06-13 13:38:05 -07:00 |
|
Antoni Baum
|
6b0511a57b
|
Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478)
|
2024-06-13 11:22:50 -07:00 |
|
Cody Yu
|
30299a41fa
|
[MISC] Remove FP8 warning (#5472)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
|
2024-06-13 11:22:30 -07:00 |
|
Tyler Michael Smith
|
85657b5607
|
[Kernel] Factor out epilogues from cutlass kernels (#5391)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-13 11:22:19 -07:00 |
|
Cyrus Leung
|
0ce7b952f8
|
[Doc] Update LLaVA docs (#5437)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-06-13 11:22:07 -07:00 |
|
Cyrus Leung
|
03dccc886e
|
[Misc] Add vLLM version getter to utils (#5098)
|
2024-06-13 11:21:39 -07:00 |
|
Li, Jiang
|
80aa7e91fc
|
[Hardware][Intel] Optimize CPU backend and add more performance tips (#4971)
Co-authored-by: Jianan Gu <jianan.gu@intel.com>
|
2024-06-13 09:33:14 -07:00 |
|
wenyujin333
|
bd43973522
|
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497)
Tune Qwen2-57B-A14B configs based on #4921
Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2
A100 GPU
benchmark no config w/ PR
tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s
tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
|
2024-06-13 09:01:10 -07:00 |
|
Dipika Sikka
|
c2637a613b
|
[Kernel] w4a16 support for compressed-tensors (#5385)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-13 10:19:56 -04:00 |
|
youkaichao
|
ea3890a5f0
|
[Core][Distributed] code deduplication in tp&pp with coordinator(#5293)
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
|
2024-06-12 17:27:08 -07:00 |
|
Isotr0py
|
2135cacb45
|
[Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451)
|
2024-06-12 16:20:18 -07:00 |
|
Michael Goin
|
7d19de2e9c
|
[Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425)
|
2024-06-12 18:42:12 -04:00 |
|
Michael Goin
|
94a07bbdd8
|
[Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470)
|
2024-06-12 21:59:44 +00:00 |
|
youkaichao
|
622d45128c
|
[misc] add hint for AttributeError (#5462)
|
2024-06-12 21:46:35 +00:00 |
|
Travis Johnson
|
51602eefd3
|
[Frontend] [Core] Support for sharded tensorized models (#4990)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-06-12 14:13:52 -07:00 |
|
Arthur Kim
|
5cc50a531f
|
[Bugfix] TYPE_CHECKING for MultiModalData (#5444)
|
2024-06-12 14:08:52 -07:00 |
|
Li, Jiang
|
c3c2903e72
|
[Bugfix] Add device assertion to TorchSDPA (#5402)
|
2024-06-12 12:58:53 -07:00 |
|
Woosuk Kwon
|
1a8bfd92d5
|
[Hardware] Initial TPU integration (#5292)
|
2024-06-12 11:53:03 -07:00 |
|
Nick Hill
|
99dac099ab
|
[Core][Doc] Default to multiprocessing for single-node distributed case (#5230)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-06-11 11:10:41 -07:00 |
|
youkaichao
|
c4bd03c7c5
|
[Core][Distributed] add same-node detection (#5369)
|
2024-06-11 10:53:59 -07:00 |
|
sasha0552
|
dcbf4286af
|
[Frontend] Customizable RoPE theta (#5197)
|
2024-06-11 10:42:26 -07:00 |
|
Ali Panahi
|
00e6a2dc53
|
[Bugfix] fix lora_dtype value type in arg_utils.py (#5398)
|
2024-06-11 10:40:23 -07:00 |
|
Junichi Sato
|
2e02311a1b
|
[Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 (#5254)
|
2024-06-11 10:38:07 -07:00 |
|
Woosuk Kwon
|
8bab4959be
|
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389)
|
2024-06-11 00:37:56 -07:00 |
|
Cyrus Leung
|
640052b069
|
[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026)
|
2024-06-10 22:36:46 -07:00 |
|
maor-ps
|
351d5e7b82
|
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-06-11 10:30:31 +08:00 |
|
Nick Hill
|
a008629807
|
[Misc] Various simplifications and typing fixes (#5368)
|
2024-06-11 10:29:02 +08:00 |
|
Simon Mo
|
114332b88e
|
Bump version to v0.5.0 (#5384)
|
2024-06-10 15:56:06 -07:00 |
|
Cyrus Leung
|
2c0d933594
|
[Bugfix] Fix LLaVA-NeXT (#5380)
|
2024-06-10 15:38:47 +00:00 |
|
Itay Etelis
|
774d1035e4
|
[Feature][Frontend]: Continued stream_options implementation also in CompletionRequest (#5319)
|
2024-06-10 14:22:09 +00:00 |
|