Antoni Baum
c33afd89f5
Fix lint ( #3388 )
2024-03-13 13:56:49 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Hui Liu
ba8dc958a3
[Minor] Fix bias in if to remove ambiguity ( #3259 )
2024-03-13 09:16:55 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
DAIZHENWEI
654865e21d
Support Mistral Model Inference with transformers-neuronx ( #3153 )
2024-03-11 13:19:51 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
Zhuohan Li
f48c6791b7
[FIX] Fix prefix test error on main ( #3286 )
2024-03-08 17:16:14 -08:00
Michael Goin
c2c5e0909a
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir ( #3241 )
2024-03-08 13:33:10 -08:00
Woosuk Kwon
1cb0cc2975
[FIX] Make flash_attn optional ( #3269 )
2024-03-08 10:52:20 -08:00
whyiug
c59e120c55
Feature add lora support for Qwen2 ( #3177 )
2024-03-07 21:58:24 -08:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
TechxGenus
d3c04b6a39
Add GPTQ support for Gemma ( #3200 )
2024-03-07 08:19:14 +08:00
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models ( #2330 )
2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Woosuk Kwon
4bd18ec0c7
[Minor] Fix type annotation in fused moe ( #3045 )
2024-02-26 19:44:29 -08:00
张大成
48a8f4a7fd
Support Orion model ( #2539 )
...
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-02-26 19:17:06 -08:00
Roy
4dd6416faf
Fix stablelm ( #3038 )
2024-02-26 18:31:10 -08:00
Roy
d9f726c4d0
[Minor] Remove unused config files ( #3039 )
2024-02-26 17:25:22 -08:00
Philipp Moritz
cfc15a1031
Optimize Triton MoE Kernel ( #2979 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-02-26 13:48:56 -08:00
Woosuk Kwon
f7c1234990
[Fix] Fissertion on YaRN model len ( #2984 )
2024-02-23 12:57:48 -08:00
44670
c530e2cfe3
[FIX] Fix a bug in initializing Yarn RoPE ( #2983 )
2024-02-22 01:40:05 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking ( #2820 )
2024-02-21 18:56:01 -08:00
Woosuk Kwon
95529e3253
Use Llama RMSNorm custom op for Gemma ( #2974 )
2024-02-21 18:28:23 -08:00
Roy
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM ( #2868 )
2024-02-21 18:25:05 -08:00
Nick Hill
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
Xiang Xu
5253edaacb
Add Gemma model ( #2964 )
2024-02-21 09:34:30 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
shiyi.c_98
64da65b322
Prefix Caching- fix t4 triton error ( #2517 )
2024-02-16 14:17:55 -08:00
Philipp Moritz
4f2ad11135
Fix DeciLM ( #2883 )
2024-02-14 22:29:57 -08:00
Philipp Moritz
31348dff03
Align LoRA code between Mistral and Mixtral ( fixes #2875 ) ( #2880 )
...
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.
* Make LoRA logic for Mistral and Mixtral the same
---------
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
2024-02-15 01:00:43 +01:00
Woosuk Kwon
25e86b6a61
Don't use cupy NCCL for AMD backends ( #2855 )
2024-02-14 12:30:44 -08:00
Roy
4efbac6d35
Migrate AquilaForCausalLM to LlamaForCausalLM ( #2867 )
2024-02-14 12:30:24 -08:00
Philipp Moritz
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 ( #2861 )
2024-02-13 18:01:15 -08:00
Philipp Moritz
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM ( #2860 )
...
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 17:12:05 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
Philipp Moritz
317b29de0f
Remove Yi model definition, please use LlamaForCausalLM instead ( #2854 )
...
Co-authored-by: Roy <jasonailu87@gmail.com>
2024-02-13 14:22:22 -08:00
Woosuk Kwon
a463c333dd
Use CuPy for CUDA graphs ( #2811 )
2024-02-13 11:32:06 -08:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models ( #2637 )" ( #2851 )
...
This reverts commit 5c976a7e1a1bec875bf6474824b7dff39e38de18.
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models ( #2637 )
2024-02-13 00:09:23 -08:00
Rex
563836496a
Refactor 2 awq gemm kernels into m16nXk32 ( #2723 )
...
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
2024-02-12 11:02:17 -08:00
Hongxia Yang
0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention ( #2768 )
2024-02-10 23:14:37 -08:00