Woosuk Kwon
cbc53b6b8d
[Hardware][TPU] Support parallel sampling & Swapping ( #5855 )
2024-06-26 11:07:49 -07:00
sasha0552
c54269d967
[Frontend] Add tokenize/detokenize endpoints ( #5054 )
2024-06-26 16:54:22 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm ( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Woosuk Kwon
3439c5a8e3
[Bugfix][TPU] Fix KV cache size calculation ( #5860 )
2024-06-26 00:58:23 -07:00
Woosuk Kwon
6806998bf9
[Bugfix] Fix embedding to support 2D inputs ( #5829 )
2024-06-26 00:15:22 -07:00
youkaichao
515080ad2f
[bugfix][distributed] fix shm broadcast when the queue size is full ( #5801 )
2024-06-25 21:56:02 -07:00
Roger Wang
3aa7b6cf66
[Misc][Doc] Add Example of using OpenAI Server with VLM ( #5832 )
2024-06-25 20:34:25 -07:00
Stephanie Wang
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>
2024-06-25 20:30:03 -07:00
aws-patlange
82079729cc
[Bugfix] Fix assertion in NeuronExecutor ( #5841 )
2024-06-25 19:52:10 -07:00
Woosuk Kwon
f178e56c68
[Hardware][TPU] Raise errors for unsupported sampling params ( #5850 )
2024-06-25 16:58:23 -07:00
Matt Wong
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes ( #5422 )
2024-06-25 15:56:15 -07:00
Woosuk Kwon
bc34937d68
[Hardware][TPU] Refactor TPU backend ( #5831 )
2024-06-25 15:25:52 -07:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 ( #5794 )
2024-06-25 19:23:35 +00:00
Antoni Baum
67882dbb44
[Core] Add fault tolerance for RayTokenizerGroupPool ( #5748 )
2024-06-25 10:15:10 -07:00
Jie Fu (傅杰)
7b99314301
[Misc] Remove useless code in cpu_worker ( #5824 )
2024-06-25 09:41:36 -07:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00
Chang Su
ba991d5c84
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args ( #5795 )
2024-06-24 17:01:19 -06:00
Isotr0py
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00
Murali Andoorveedu
5d4d90536f
[Distributed] Add send and recv helpers ( #5719 )
2024-06-23 14:42:28 -07:00
youkaichao
832ea88fcb
[core][distributed] improve shared memory broadcast ( #5754 )
2024-06-22 10:00:43 -07:00
Woosuk Kwon
0cbc1d2b4f
[Bugfix] Fix pin_lora error in TPU executor ( #5760 )
2024-06-21 22:25:14 -07:00
zifeitong
ff9ddbceee
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py ( #5756 )
2024-06-22 03:33:12 +00:00
Jie Fu (傅杰)
9c62db07ed
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs ( #5710 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-22 02:07:08 +00:00
rohithkrn
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
Roger Wang
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
youkaichao
d9a252bc8e
[Core][Distributed] add shm broadcast ( #5399 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-21 05:12:35 +00:00
Jee Li
67005a07bc
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora ( #5665 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-21 04:46:28 +00:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
youkaichao
6c5b7af152
[distributed][misc] use fork by default for mp ( #5669 )
2024-06-20 17:06:34 -07:00
Michael Goin
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
zifeitong
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run ( #5654 )
2024-06-19 13:57:12 -07:00
Michael Goin
afed90a034
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py ( #5688 )
2024-06-19 14:41:42 -04:00
Michael Goin
da971ec7a5
[Model] Add FP8 kv cache for Qwen2 ( #5656 )
2024-06-19 09:38:26 +00:00
youkaichao
3eea74889f
[misc][distributed] use 127.0.0.1 for single-node ( #5619 )
2024-06-19 08:05:00 +00:00
Shukant Pal
59a1eb59c9
[Bugfix] Fix Phi-3 Long RoPE scaling implementation ( #5628 )
2024-06-19 01:46:38 +00:00
Thomas Parnell
8a173382c8
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties ( #5639 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-18 14:18:37 -07:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
Ronen Schaffer
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Chang Su
f0cc0e68e3
[Misc] Remove import from transformers logging ( #5625 )
2024-06-18 12:12:19 +00:00
youkaichao
db5ec52ad7
[bugfix][distributed] improve p2p capability test ( #5612 )
...
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612 )
2024-06-18 07:21:05 +00:00
youkaichao
8eadcf0b90
[misc][typo] fix typo ( #5620 )
2024-06-17 20:54:57 -07:00
Isotr0py
daef218b55
[Model] Initialize Phi-3-vision support ( #4986 )
2024-06-17 19:34:33 -07:00
sroy745
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier ( #5131 )
2024-06-17 21:29:09 -05:00
zifeitong
26e1188e51
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py ( #5606 )
2024-06-17 23:16:10 +00:00
Bruce Fontaine
a3e8a05d4c
[Bugfix] Fix KV head calculation for MPT models when using GQA ( #5142 )
2024-06-17 15:26:41 -07:00
youkaichao
e441bad674
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids ( #5584 )
2024-06-17 22:08:05 +00:00