3077 Commits

Author SHA1 Message Date
Kuntai Du
ca30c3c84b
[Core] Remove evictor_v1 (#9572) 2024-10-22 04:55:49 +00:00
Wallas Henrique
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones (#9570)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-22 04:52:14 +00:00
Falko1
74692421f7
[Bugfix]: phi.py get rope_theta from config file (#9503)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-22 02:53:36 +00:00
ngrozae
29acd2c34c
[Bugfix][OpenVINO] fix_dockerfile_openvino (#9552) 2024-10-21 19:47:52 -07:00
Cyrus Leung
f085995a7b
[CI/Build] Remove unnecessary fork_new_process (#9484) 2024-10-21 19:47:29 -07:00
Travis Johnson
b729901139
[Bugfix]: serialize config by value for --trust-remote-code (#6751)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-21 19:46:24 -07:00
youkaichao
76a5e13270
[core] move parallel sampling out from vllm core (#9302) 2024-10-22 00:31:44 +00:00
Joe Runde
ef7faad1b8
🐛 Fixup more test failures from memory profiling (#9563)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-21 17:10:56 -07:00
Kuntai Du
575dcebe9a
[CI] Make format checker error message more user-friendly by using emoji (#9564)
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
Wallas Henrique
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch (#9023)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-21 14:49:41 -07:00
Nick Hill
15713e3b75
[BugFix] Update draft model TP size check to allow matching target TP size (#9394)
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com>
2024-10-21 14:14:29 -07:00
youkaichao
d621c43df7
[doc] fix format (#9562) 2024-10-21 13:54:57 -07:00
Nick Hill
9d9186be97
[Frontend] Reduce frequency of client cancellation checking (#7959) 2024-10-21 13:28:10 -07:00
Michael Goin
5241aa1494
[Model][Bugfix] Fix batching with multi-image in PixtralHF (#9518) 2024-10-21 14:20:07 -04:00
Varad Ahirwadkar
ec6bd6c4c6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint (#9492)
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com>
2024-10-21 17:43:02 +00:00
yudian0504
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder (#9549) 2024-10-21 17:33:30 +00:00
Dhia Eddine Rhaiem
f6b97293aa
[Model] FalconMamba Support (#9325) 2024-10-21 12:50:16 -04:00
Thomas Parnell
496e991da8
[Doc] Consistent naming of attention backends (#9498)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-10-21 22:29:57 +08:00
Cyrus Leung
696b01af8f
[CI/Build] Split up decoder-only LM tests (#9488)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-20 21:27:50 -07:00
Andy Dai
855e0e6f97
[Frontend][Misc] Goodput metric support (#9338) 2024-10-20 18:39:32 +00:00
Chen Zhang
4fa3e33349
[Kernel] Support sliding window in flash attention backend (#9403) 2024-10-20 10:57:52 -07:00
Michael Goin
962d2c6349
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision (#9520) 2024-10-20 05:29:14 +00:00
Chen Zhang
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger (#9530) 2024-10-20 00:05:02 +00:00
Michael Goin
8e3e7f2713
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf (#9514) 2024-10-19 10:44:29 -04:00
Cyrus Leung
263d8ee150
[Bugfix] Fix missing task for speculative decoding (#9524) 2024-10-19 06:49:40 +00:00
Yue Zhang
c5eea3c8ba
[Frontend] Support simpler image input format (#9478) 2024-10-18 23:17:07 -07:00
Russell Bryant
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow (#9511)
Signed-off-by: Russell Bryant <russell.bryant@gmail.com>
2024-10-19 06:04:18 +00:00
Russell Bryant
dfd951ed9b
[CI/Build] Add error matching for ruff output (#9513)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-19 05:42:20 +00:00
Joe Runde
82c25151ec
[Doc] update gpu-memory-utilization flag docs (#9507)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-19 11:26:36 +08:00
Nick Hill
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily (#9521) 2024-10-18 20:21:01 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling (#9516)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
sasha0552
337ed76671
[Bugfix] Fix offline mode when using mistral_common (#9457) 2024-10-18 18:12:32 -07:00
Thomas Parnell
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores (#9497)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-10-18 17:55:48 -07:00
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Kunjan
9bb10a7d27
[MISC] Add lora requests to metrics (#9477)
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
Michael Goin
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format (#9036) 2024-10-18 13:29:56 -06:00
Russell Bryant
67a7e5ef38
[CI/Build] Add error matching config for mypy (#9512) 2024-10-18 12:17:53 -07:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Russell Bryant
7dbe738d65
[Misc] benchmark: Add option to set max concurrency (#9390)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-18 11:15:28 -07:00
Tyler Michael Smith
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py (#9505) 2024-10-18 16:59:19 +00:00
Cyrus Leung
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer (#9504) 2024-10-19 00:09:35 +08:00
Nick Hill
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming (#9475) 2024-10-18 14:10:26 +00:00
tomeras91
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser (#9154) 2024-10-18 10:27:48 +00:00
Nick Hill
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473) 2024-10-18 07:19:53 +00:00
Russell Bryant
944dd8edaf
[CI/Build] Use commit hash references for github actions (#9430) 2024-10-17 21:54:58 -07:00
Haoyu Wang
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 (#9467) 2024-10-18 04:40:14 +00:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Dipika Sikka
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing (#9381) 2024-10-17 18:54:00 -07:00
Robert Shaw
343f8e0905
Support BERTModel (first encoder-only embedding model) (#9056)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-10-17 23:21:01 +00:00
Shashwat Srijan
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library (#9380) 2024-10-17 15:39:39 -07:00