Robert Shaw
|
889da130e7
|
[ Misc ] fp8-marlin channelwise via compressed-tensors (#6524)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-07-25 09:46:04 -07:00 |
|
Alphi
|
b75e314fff
|
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V (#6787)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-25 09:42:49 -07:00 |
|
Alexander Matveev
|
0310029a2f
|
[Bugfix] Fix awq_marlin and gptq_marlin flags (#6745)
|
2024-07-24 22:34:11 -07:00 |
|
Cody Yu
|
309aaef825
|
[Bugfix] Fix decode tokens w. CUDA graph (#6757)
|
2024-07-24 22:33:56 -07:00 |
|
Alphi
|
9e169a4c61
|
[Model] Adding support for MiniCPM-V (#4087)
|
2024-07-24 20:59:30 -07:00 |
|
Evan Z. Liu
|
5689e256ba
|
[Frontend] Represent tokens with identifiable strings (#6626)
|
2024-07-25 09:51:00 +08:00 |
|
youkaichao
|
740374d456
|
[core][distributed] fix zmq hang (#6759)
|
2024-07-24 17:37:12 -07:00 |
|
Antoni Baum
|
5448f67635
|
[Core] Tweaks to model runner/input builder developer APIs (#6712)
|
2024-07-24 12:17:12 -07:00 |
|
Antoni Baum
|
0e63494cf3
|
Add fp8 support to reshape_and_cache_flash (#6667)
|
2024-07-24 18:36:52 +00:00 |
|
Daniele
|
ee812580f7
|
[Frontend] split run_server into build_server and run_server (#6740)
|
2024-07-24 10:36:04 -07:00 |
|
Allen.Dou
|
40468b13fa
|
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686)
|
2024-07-24 08:58:42 -07:00 |
|
LF Marques
|
545146349c
|
Adding f-string to validation error which is missing (#6748)
|
2024-07-24 08:55:53 -07:00 |
|
liuyhwangyh
|
f4f8a9d892
|
[Bugfix]fix modelscope compatible issue (#6730)
|
2024-07-24 05:04:46 -07:00 |
|
Roger Wang
|
0a740a11ba
|
[Bugfix] Fix token padding for chameleon (#6724)
|
2024-07-24 01:05:09 -07:00 |
|
William Lin
|
5e8ca973eb
|
[Bugfix] fix flashinfer cudagraph capture for PP (#6708)
|
2024-07-24 01:49:44 +00:00 |
|
dongmao zhang
|
87525fab92
|
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-23 23:45:09 +00:00 |
|
Thomas Parnell
|
2f808e69ab
|
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-23 23:05:05 +00:00 |
|
Roger Wang
|
1bedf210e3
|
Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690)
|
2024-07-23 13:47:48 -07:00 |
|
Travis Johnson
|
507ef787d8
|
[Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-07-23 12:22:09 -07:00 |
|
Yehoshua Cohen
|
58f53034ad
|
[Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652)
|
2024-07-23 11:41:55 -07:00 |
|
Michael Goin
|
0eb0757bef
|
[Misc] Add ignored layers for fp8 quantization (#6657)
|
2024-07-23 14:04:04 -04:00 |
|
Simon Mo
|
38c4b7e863
|
Bump version to 0.5.3.post1 (#6696)
|
2024-07-23 10:08:59 -07:00 |
|
Woosuk Kwon
|
a112a84aad
|
[BugFix] Fix RoPE error in Llama 3.1 (#6693)
|
2024-07-23 09:46:05 -07:00 |
|
Woosuk Kwon
|
461089a21a
|
[Bugfix] Fix a log error in chunked prefill (#6694)
|
2024-07-23 09:27:58 -07:00 |
|
Simon Mo
|
bb2fc08072
|
Bump version to v0.5.3 (#6674)
|
2024-07-23 00:00:08 -07:00 |
|
Simon Mo
|
3eda4ec780
|
support ignore patterns in model loader (#6673)
|
2024-07-22 23:59:42 -07:00 |
|
Roger Wang
|
22fa2e35cb
|
[VLM][Model] Support image input for Chameleon (#6633)
|
2024-07-22 23:50:48 -07:00 |
|
youkaichao
|
c5201240a4
|
[misc] only tqdm for first rank (#6672)
|
2024-07-22 21:57:27 -07:00 |
|
Cyrus Leung
|
97234be0ec
|
[Misc] Manage HTTP connections in one place (#6600)
|
2024-07-22 21:32:02 -07:00 |
|
Michael Goin
|
9e0b558a09
|
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528)
|
2024-07-23 04:11:50 +00:00 |
|
zhaotyer
|
e519ae097a
|
add tqdm when loading checkpoint shards (#6569)
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-07-22 20:48:01 -07:00 |
|
youkaichao
|
7c2749a4fd
|
[misc] add start loading models for users information (#6670)
|
2024-07-22 20:08:02 -07:00 |
|
Woosuk Kwon
|
729171ae58
|
[Misc] Enable chunked prefill by default for long context models (#6666)
|
2024-07-22 20:03:13 -07:00 |
|
Cheng Li
|
c5e8330997
|
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization (#6665)
|
2024-07-22 19:25:05 -07:00 |
|
Cody Yu
|
e0c15758b8
|
[Core] Modulize prepare input and attention metadata builder (#6596)
|
2024-07-23 00:45:24 +00:00 |
|
Woosuk Kwon
|
bdf5fd1386
|
[Misc] Remove deprecation warning for beam search (#6659)
|
2024-07-23 00:21:58 +00:00 |
|
Jiaxin Shan
|
42c7f66a38
|
[Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-07-22 15:42:40 -07:00 |
|
Cyrus Leung
|
739b61a348
|
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-22 10:13:53 -07:00 |
|
Jae-Won Chung
|
89c1c6a196
|
[Bugfix] Fix vocab_size field access in llava_next.py (#6624)
|
2024-07-22 05:02:51 +00:00 |
|
Woosuk Kwon
|
42de2cefcb
|
[Misc] Add a wrapper for torch.inference_mode (#6618)
|
2024-07-21 18:43:11 -07:00 |
|
Roger Wang
|
c9eef37f32
|
[Model] Initial Support for Chameleon (#5770)
|
2024-07-21 17:37:51 -07:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
Isotr0py
|
25e778aa16
|
[Model] Refactor and decouple phi3v image embedding (#6621)
|
2024-07-21 16:07:58 -07:00 |
|
Woosuk Kwon
|
b6df37f943
|
[Misc] Remove abused noqa (#6619)
|
2024-07-21 23:47:04 +08:00 |
|
sroy745
|
14f91fe67c
|
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
|
2024-07-20 23:58:58 -07:00 |
|
Cyrus Leung
|
d7f4178dd9
|
[Frontend] Move chat utils (#6602)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-21 08:38:17 +08:00 |
|
Robert Shaw
|
082ecd80d5
|
[ Bugfix ] Fix AutoFP8 fp8 marlin (#6609)
|
2024-07-20 17:25:56 -06:00 |
|
Michael Goin
|
f952bbc8ff
|
[Misc] Fix input_scale typing in w8a8_utils.py (#6579)
|
2024-07-20 23:11:13 +00:00 |
|
Robert Shaw
|
9364f74eee
|
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606)
|
2024-07-20 18:50:10 +00:00 |
|
Matt Wong
|
06d6c5fe9f
|
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543)
|
2024-07-20 09:39:07 -07:00 |
|