521 Commits

Author SHA1 Message Date
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner (#3233) 2024-03-20 23:25:01 +00:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes (#3236) 2024-03-20 14:46:05 -07:00
Antoni Baum
426ec4ec67
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-20 14:45:08 -07:00
Woosuk Kwon
5ee14494e4
[Misc] Remove cache stream and cache events (#3461) 2024-03-20 00:38:53 -07:00
Nick Hill
4ad521d8b5
[Core] Add generic typing to LRUCache (#3511) 2024-03-20 00:36:09 -07:00
ElizaWszola
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-20 00:11:11 -07:00
Simon Mo
20478c4d3a
Use lru_cache for some environment detection utils (#3508) 2024-03-19 21:34:15 +00:00
Simon Mo
cc63d03fbb
Revert "[Core] Cache some utils" (#3507) 2024-03-19 13:22:58 -07:00
Nick Hill
7341c77d69
[BugFix] Avoid initializing CUDA too early (#3487) 2024-03-18 23:05:20 -07:00
Simon Mo
ef65dcfa6f
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-18 22:05:34 -07:00
youkaichao
6a9c583e73
[Core] print error before deadlock (#3459) 2024-03-19 04:06:23 +00:00
Antoni Baum
b37cdce2b1
[Core] Cache some utils (#3474) 2024-03-18 17:14:26 -07:00
Antoni Baum
49eedea373
[Core] Zero-copy asdict for InputMetadata (#3475) 2024-03-18 22:56:40 +00:00
Woosuk Kwon
abfc4f3387
[Misc] Use dataclass for InputMetadata (#3452)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-03-17 10:02:46 +00:00
Simon Mo
120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) 2024-03-16 13:35:27 -07:00
Robert Shaw
10585e035e
Removed Extraneous Print Message From OAI Server (#3440) 2024-03-16 00:35:36 +00:00
Antoni Baum
fb96c1e98c
Asynchronous tokenization (#2879) 2024-03-15 23:37:01 +00:00
Tao He
14b8ae02e7
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-03-15 18:25:43 +00:00
Dan Clark
03d37f2441
[Fix] Add args for mTLS support (#3430)
Co-authored-by: declark1 <daniel.clark@ibm.com>
2024-03-15 09:56:13 -07:00
Yang Fan
a7c871680e
Fix tie_word_embeddings for Qwen2. (#3344) 2024-03-15 09:36:53 -07:00
Junda Chen
429284dc37
Fix dist.broadcast stall without group argument (#3408) 2024-03-14 23:25:05 -07:00
youkaichao
b522c4476f
[Misc] add HOST_IP env var (#3419)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-14 21:32:52 -07:00
Enrique Shockwave
b983ba35bd
fix marlin config repr (#3414) 2024-03-14 16:26:19 -07:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Dan Clark
c17ca8ef18
Add args for mTLS support (#3410)
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
2024-03-14 13:11:45 -07:00
youkaichao
8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) 2024-03-14 08:11:48 +00:00
Zhuohan Li
eeab52a4ff
[FIX] Simpler fix for async engine running on ray (#3371) 2024-03-13 14:18:40 -07:00
Antoni Baum
c33afd89f5
Fix lint (#3388) 2024-03-13 13:56:49 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Hui Liu
ba8dc958a3
[Minor] Fix bias in if to remove ambiguity (#3259) 2024-03-13 09:16:55 -07:00
Bo-Wen Wang
b167109ba1
[Fix] Fix quantization="gptq" when using Marlin (#3319)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-03-12 22:51:42 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free (#3347) 2024-03-13 00:30:08 +00:00
DAIZHENWEI
654865e21d
Support Mistral Model Inference with transformers-neuronx (#3153) 2024-03-11 13:19:51 -07:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Nick Hill
4b59f00e91
[Fix] Fix best_of behavior when n=1 (#3298) 2024-03-10 19:17:46 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray (#3301) 2024-03-10 19:17:16 -07:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) 2024-03-08 23:32:46 -08:00
Zhuohan Li
f48c6791b7
[FIX] Fix prefix test error on main (#3286) 2024-03-08 17:16:14 -08:00
Michael Goin
c2c5e0909a
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir (#3241) 2024-03-08 13:33:10 -08:00
Woosuk Kwon
1cb0cc2975
[FIX] Make flash_attn optional (#3269) 2024-03-08 10:52:20 -08:00
whyiug
c59e120c55
Feature add lora support for Qwen2 (#3177) 2024-03-07 21:58:24 -08:00
Nick Hill
d2339d6840
Connect engine healthcheck to openai server (#3260) 2024-03-07 16:38:12 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug (#3239) 2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263) 2024-03-07 23:03:22 +00:00
Michael Goin
385da2dae2
Measure model memory usage (#3120) 2024-03-07 11:42:42 -08:00
Woosuk Kwon
2daf23ab0c
Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
TechxGenus
d3c04b6a39
Add GPTQ support for Gemma (#3200) 2024-03-07 08:19:14 +08:00
Chujie Zheng
4cb3b924cd
Add tqdm dynamic_ncols=True (#3242) 2024-03-06 22:41:42 +00:00