Jie Fu (傅杰)
9c62db07ed
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs ( #5710 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-22 02:07:08 +00:00
Roger Wang
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Michael Goin
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
Michael Goin
da971ec7a5
[Model] Add FP8 kv cache for Qwen2 ( #5656 )
2024-06-19 09:38:26 +00:00
Shukant Pal
59a1eb59c9
[Bugfix] Fix Phi-3 Long RoPE scaling implementation ( #5628 )
2024-06-19 01:46:38 +00:00
Thomas Parnell
8a173382c8
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties ( #5639 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-18 14:18:37 -07:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
Chang Su
f0cc0e68e3
[Misc] Remove import from transformers logging ( #5625 )
2024-06-18 12:12:19 +00:00
Isotr0py
daef218b55
[Model] Initialize Phi-3-vision support ( #4986 )
2024-06-17 19:34:33 -07:00
sroy745
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier ( #5131 )
2024-06-17 21:29:09 -05:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend ( #3814 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Dipika Sikka
890d8d960b
[Kernel] compressed-tensors marlin 24 support ( #5435 )
2024-06-17 12:32:48 -04:00
Amit Garg
9333fb8eb9
[Model] Rename Phi3 rope scaling type ( #5595 )
2024-06-17 12:04:14 -04:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
Thomas Parnell
e2afb03c92
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models ( #5460 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-14 20:28:11 +00:00
Robert Shaw
15985680e2
[ Misc ] Rs/compressed tensors cleanup ( #5432 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
2024-06-14 10:01:46 -07:00
Tyler Michael Smith
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
Tyler Michael Smith
e38042d4af
[Kernel] Disable CUTLASS kernels for fp8 ( #5505 )
2024-06-13 13:38:05 -07:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cyrus Leung
0ce7b952f8
[Doc] Update LLaVA docs ( #5437 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-13 11:22:07 -07:00
wenyujin333
bd43973522
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 ( #5497 )
...
Tune Qwen2-57B-A14B configs based on #4921
Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2
A100 GPU
benchmark no config w/ PR
tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s
tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
2024-06-13 09:01:10 -07:00
Dipika Sikka
c2637a613b
[Kernel] w4a16 support for compressed-tensors ( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 10:19:56 -04:00
Travis Johnson
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 14:13:52 -07:00
Woosuk Kwon
1a8bfd92d5
[Hardware] Initial TPU integration ( #5292 )
2024-06-12 11:53:03 -07:00
Nick Hill
a008629807
[Misc] Various simplifications and typing fixes ( #5368 )
2024-06-11 10:29:02 +08:00
Cyrus Leung
2c0d933594
[Bugfix] Fix LLaVA-NeXT ( #5380 )
2024-06-10 15:38:47 +00:00
Cyrus Leung
6b29d6fe70
[Model] Initial support for LLaVA-NeXT ( #4199 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-10 12:47:15 +00:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
Michael Goin
c09dade2a2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale ( #5353 )
2024-06-08 13:54:05 -04:00
Cheng Li
e69ded7d1c
[Bug Fix] Fix the support check for FP8 CUTLASS ( #5352 )
...
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)
This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183 .
2024-06-08 00:42:05 +00:00
Calvinn Ng
767c727a81
fix DbrxFusedNormAttention missing cache_config ( #5340 )
...
Co-authored-by: team <calvinn.ng@ahrefs.com>
2024-06-07 14:10:21 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
Tyler Michael Smith
8d75fe48ca
[Kernel] Switch fp8 layers to use the CUTLASS kernels ( #5183 )
...
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8
see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
2024-06-07 08:42:35 +00:00
Antoni Baum
ccdc490dda
[Core] Change LoRA embedding sharding to support loading methods ( #5038 )
2024-06-06 19:07:57 -07:00
Antoni Baum
a31cab7556
[Core] Avoid copying prompt/output tokens if no penalties are used ( #5289 )
2024-06-06 18:12:00 -07:00
Philipp Moritz
abe855d637
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 ( #5294 )
2024-06-06 09:29:29 -07:00
Breno Faria
7b0a0dfb22
[Frontend][Core] Update Outlines Integration from FSM to Guide ( #4109 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
2024-06-05 16:49:12 -07:00
Woosuk Kwon
6a7c7711a2
[Misc] Skip for logits_scale == 1.0 ( #5291 )
2024-06-05 15:19:02 -07:00
Philipp Moritz
51a08e7d8f
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 ( #5238 )
2024-06-05 10:59:14 -07:00
Cody Yu
5563a4dea8
[Model] Correct Mixtral FP8 checkpoint loading ( #5231 )
2024-06-05 10:58:50 -07:00
Woosuk Kwon
41ca62cf03
[Misc] Add CustomOp interface for device portability ( #5255 )
2024-06-05 09:18:19 -07:00
Woosuk Kwon
3a434b07ed
[Kernel] Enhance MoE benchmarking & tuning script ( #4921 )
2024-06-03 20:06:59 -07:00
Toshiki Kataoka
06b2550cbb
[Bugfix] Support prompt_logprobs==0 ( #5217 )
2024-06-03 17:59:30 -07:00
Breno Faria
f775a07e30
[FRONTEND] OpenAI tools support named functions ( #5032 )
2024-06-03 18:25:29 -05:00