606 Commits

Author SHA1 Message Date
Tyler Michael Smith
5932634409
Unmark fused_moe config json file as executable (#5960) 2024-06-28 06:36:12 -07:00
Cyrus Leung
5cbe8d155c
[Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-06-28 12:09:56 +00:00
Divakar Verma
c3dde367f1
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932) 2024-06-27 13:41:08 -07:00
Woosuk Kwon
79c92c7c8a
[Model] Add Gemma 2 (#5908) 2024-06-27 13:33:56 -07:00
Nick Hill
691e29ecf3
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens (#5876) 2024-06-27 10:59:33 -07:00
Cyrus Leung
98cf2ed678
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) 2024-06-27 09:08:10 -07:00
Roger Wang
2061f0b8a7
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888) 2024-06-27 08:29:24 +00:00
Cyrus Leung
96354d6a29
[Model] Add base class for LoRA-supported models (#5018) 2024-06-27 16:03:04 +08:00
Woosuk Kwon
6806998bf9
[Bugfix] Fix embedding to support 2D inputs (#5829) 2024-06-26 00:15:22 -07:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 (#5794) 2024-06-25 19:23:35 +00:00
Isotr0py
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772) 2024-06-24 12:11:53 +08:00
Jie Fu (傅杰)
9c62db07ed
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (#5710)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-22 02:07:08 +00:00
Roger Wang
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU (#5717) 2024-06-21 06:39:40 +00:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support (#4947)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
2024-06-20 20:23:12 -04:00
Michael Goin
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 2024-06-20 17:00:13 -06:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) 2024-06-20 18:36:10 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs (#5591) 2024-06-20 11:52:09 +00:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) 2024-06-19 18:06:44 -04:00
Michael Goin
da971ec7a5
[Model] Add FP8 kv cache for Qwen2 (#5656) 2024-06-19 09:38:26 +00:00
Shukant Pal
59a1eb59c9
[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628) 2024-06-19 01:46:38 +00:00
Thomas Parnell
8a173382c8
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-18 14:18:37 -07:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r (#5178) 2024-06-18 11:01:21 -07:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542) 2024-06-18 12:45:05 -04:00
Chang Su
f0cc0e68e3
[Misc] Remove import from transformers logging (#5625) 2024-06-18 12:12:19 +00:00
Isotr0py
daef218b55
[Model] Initialize Phi-3-vision support (#4986) 2024-06-17 19:34:33 -07:00
sroy745
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131) 2024-06-17 21:29:09 -05:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
2024-06-17 11:01:25 -07:00
Dipika Sikka
890d8d960b
[Kernel] compressed-tensors marlin 24 support (#5435) 2024-06-17 12:32:48 -04:00
Amit Garg
9333fb8eb9
[Model] Rename Phi3 rope scaling type (#5595) 2024-06-17 12:04:14 -04:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Thomas Parnell
e2afb03c92
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-14 20:28:11 +00:00
Robert Shaw
15985680e2
[ Misc ] Rs/compressed tensors cleanup (#5432)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
2024-06-14 10:01:46 -07:00
Tyler Michael Smith
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) 2024-06-14 09:30:15 -07:00
Tyler Michael Smith
e38042d4af
[Kernel] Disable CUTLASS kernels for fp8 (#5505) 2024-06-13 13:38:05 -07:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels (#5391)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cyrus Leung
0ce7b952f8
[Doc] Update LLaVA docs (#5437)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-13 11:22:07 -07:00
wenyujin333
bd43973522
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497)
Tune Qwen2-57B-A14B configs based on #4921

Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2

A100 GPU

benchmark	no config	w/ PR
tp=2	10.53 requests/s, 11058.17 tokens/s	12.47 requests/s, 13088.57 tokens/s
tp=4	17.77 requests/s, 18662.95 tokens/s	20.20 requests/s, 21212.32 tokens/s
2024-06-13 09:01:10 -07:00
Dipika Sikka
c2637a613b
[Kernel] w4a16 support for compressed-tensors (#5385)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 10:19:56 -04:00
Travis Johnson
51602eefd3
[Frontend] [Core] Support for sharded tensorized models (#4990)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-12 14:13:52 -07:00
Woosuk Kwon
1a8bfd92d5
[Hardware] Initial TPU integration (#5292) 2024-06-12 11:53:03 -07:00
Nick Hill
a008629807
[Misc] Various simplifications and typing fixes (#5368) 2024-06-11 10:29:02 +08:00
Cyrus Leung
2c0d933594
[Bugfix] Fix LLaVA-NeXT (#5380) 2024-06-10 15:38:47 +00:00
Cyrus Leung
6b29d6fe70
[Model] Initial support for LLaVA-NeXT (#4199)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-06-10 12:47:15 +00:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors config (#5350)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-06-10 03:49:46 +00:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 2024-06-09 16:23:30 -04:00
Michael Goin
c09dade2a2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) 2024-06-08 13:54:05 -04:00
Cheng Li
e69ded7d1c
[Bug Fix] Fix the support check for FP8 CUTLASS (#5352)
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.
2024-06-08 00:42:05 +00:00
Calvinn Ng
767c727a81
fix DbrxFusedNormAttention missing cache_config (#5340)
Co-authored-by: team <calvinn.ng@ahrefs.com>
2024-06-07 14:10:21 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
Tyler Michael Smith
8d75fe48ca
[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183)
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
2024-06-07 08:42:35 +00:00