Robert Shaw
|
10585e035e
|
Removed Extraneous Print Message From OAI Server (#3440)
|
2024-03-16 00:35:36 +00:00 |
|
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
|
Tao He
|
14b8ae02e7
|
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
Signed-off-by: Tao He <sighingnow@gmail.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-03-15 18:25:43 +00:00 |
|
Dan Clark
|
03d37f2441
|
[Fix] Add args for mTLS support (#3430)
Co-authored-by: declark1 <daniel.clark@ibm.com>
|
2024-03-15 09:56:13 -07:00 |
|
Yang Fan
|
a7c871680e
|
Fix tie_word_embeddings for Qwen2. (#3344)
|
2024-03-15 09:36:53 -07:00 |
|
Junda Chen
|
429284dc37
|
Fix dist.broadcast stall without group argument (#3408)
|
2024-03-14 23:25:05 -07:00 |
|
youkaichao
|
b522c4476f
|
[Misc] add HOST_IP env var (#3419)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-14 21:32:52 -07:00 |
|
Enrique Shockwave
|
b983ba35bd
|
fix marlin config repr (#3414)
|
2024-03-14 16:26:19 -07:00 |
|
陈序
|
54be8a0be2
|
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
|
2024-03-14 13:56:57 -07:00 |
|
Dan Clark
|
c17ca8ef18
|
Add args for mTLS support (#3410)
Co-authored-by: Daniel Clark <daniel.clark@ibm.com>
|
2024-03-14 13:11:45 -07:00 |
|
youkaichao
|
8fe8386591
|
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
|
2024-03-14 08:11:48 +00:00 |
|
Zhuohan Li
|
eeab52a4ff
|
[FIX] Simpler fix for async engine running on ray (#3371)
|
2024-03-13 14:18:40 -07:00 |
|
Antoni Baum
|
c33afd89f5
|
Fix lint (#3388)
|
2024-03-13 13:56:49 -07:00 |
|
Terry
|
7e9bd08f60
|
Add batched RoPE kernel (#3095)
|
2024-03-13 13:45:26 -07:00 |
|
Hui Liu
|
ba8dc958a3
|
[Minor] Fix bias in if to remove ambiguity (#3259)
|
2024-03-13 09:16:55 -07:00 |
|
Bo-Wen Wang
|
b167109ba1
|
[Fix] Fix quantization="gptq" when using Marlin (#3319)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-03-12 22:51:42 -07:00 |
|
Woosuk Kwon
|
602358f8a8
|
Add kernel for GeGLU with approximate GELU (#3337)
|
2024-03-12 22:06:17 -07:00 |
|
Breno Faria
|
49a3c8662b
|
Fixes #1556 double free (#3347)
|
2024-03-13 00:30:08 +00:00 |
|
DAIZHENWEI
|
654865e21d
|
Support Mistral Model Inference with transformers-neuronx (#3153)
|
2024-03-11 13:19:51 -07:00 |
|
Zhuohan Li
|
4c922709b6
|
Add distributed model executor abstraction (#3191)
|
2024-03-11 11:03:45 -07:00 |
|
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
|
Nick Hill
|
4b59f00e91
|
[Fix] Fix best_of behavior when n=1 (#3298)
|
2024-03-10 19:17:46 -07:00 |
|
Roy
|
9e8744a545
|
[BugFix] Fix get tokenizer when using ray (#3301)
|
2024-03-10 19:17:16 -07:00 |
|
Cade Daniel
|
8437bae6ef
|
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103)
|
2024-03-08 23:32:46 -08:00 |
|
Zhuohan Li
|
f48c6791b7
|
[FIX] Fix prefix test error on main (#3286)
|
2024-03-08 17:16:14 -08:00 |
|
Michael Goin
|
c2c5e0909a
|
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir (#3241)
|
2024-03-08 13:33:10 -08:00 |
|
Woosuk Kwon
|
1cb0cc2975
|
[FIX] Make flash_attn optional (#3269)
|
2024-03-08 10:52:20 -08:00 |
|
whyiug
|
c59e120c55
|
Feature add lora support for Qwen2 (#3177)
|
2024-03-07 21:58:24 -08:00 |
|
Nick Hill
|
d2339d6840
|
Connect engine healthcheck to openai server (#3260)
|
2024-03-07 16:38:12 -08:00 |
|
ElizaWszola
|
b35cc93420
|
Fix auto prefix bug (#3239)
|
2024-03-07 16:37:28 -08:00 |
|
jacobthebanana
|
8cbba4622c
|
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263)
|
2024-03-07 23:03:22 +00:00 |
|
Michael Goin
|
385da2dae2
|
Measure model memory usage (#3120)
|
2024-03-07 11:42:42 -08:00 |
|
Woosuk Kwon
|
2daf23ab0c
|
Separate attention backends (#3005)
|
2024-03-07 01:45:50 -08:00 |
|
TechxGenus
|
d3c04b6a39
|
Add GPTQ support for Gemma (#3200)
|
2024-03-07 08:19:14 +08:00 |
|
Chujie Zheng
|
4cb3b924cd
|
Add tqdm dynamic_ncols=True (#3242)
|
2024-03-06 22:41:42 +00:00 |
|
Cade Daniel
|
a33ce60c66
|
[Testing] Fix core tests (#3224)
|
2024-03-06 01:04:23 -08:00 |
|
Nick Hill
|
2efce05dc3
|
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-03-06 00:17:20 +00:00 |
|
Nick Hill
|
8999ec3c16
|
Store eos_token_id in Sequence for easy access (#3166)
|
2024-03-05 15:35:43 -08:00 |
|
Hongxia Yang
|
05af6da8d9
|
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123)
Co-authored-by: lcskrishna <lollachaitanya@gmail.com>
|
2024-03-04 18:14:53 -08:00 |
|
Antoni Baum
|
ff578cae54
|
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-04 22:01:40 +00:00 |
|
Antoni Baum
|
22de45235c
|
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-03-04 19:54:06 +00:00 |
|
ttbachyinsda
|
76e8a70476
|
[Minor fix] The domain dns.google may cause a socket.gaierror exception (#3176)
Co-authored-by: guofangze <guofangze@kuaishou.com>
|
2024-03-04 19:17:12 +00:00 |
|
Philipp Moritz
|
17c3103c56
|
Make it easy to profile workers with nsight (#3162)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-03 16:19:13 -08:00 |
|
Zhuohan Li
|
996d095c54
|
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158)
|
2024-03-03 14:37:18 -08:00 |
|
Jason Cox
|
d65fac2738
|
Add vLLM version info to logs and openai API server (#3161)
|
2024-03-02 21:00:29 -08:00 |
|
Sage Moore
|
ce4f5a29fb
|
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-03-02 00:50:01 -08:00 |
|
cloudhan
|
baee28c46c
|
Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104)
|
2024-03-02 14:34:48 +08:00 |
|
Allen.Dou
|
29e70e3e88
|
allow user chose log level by --log-level instead of fixed 'info'. (#3109)
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-01 23:28:41 +00:00 |
|
Woosuk Kwon
|
82091b864a
|
Bump up to v0.3.3 (#3129)
|
2024-03-01 12:58:06 -08:00 |
|
Robert Shaw
|
c0c2335ce0
|
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
|
2024-03-01 12:47:51 -08:00 |
|