3052 Commits

Author SHA1 Message Date
Yue Zhang
c5eea3c8ba
[Frontend] Support simpler image input format (#9478) 2024-10-18 23:17:07 -07:00
Russell Bryant
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow (#9511)
Signed-off-by: Russell Bryant <russell.bryant@gmail.com>
2024-10-19 06:04:18 +00:00
Russell Bryant
dfd951ed9b
[CI/Build] Add error matching for ruff output (#9513)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-19 05:42:20 +00:00
Joe Runde
82c25151ec
[Doc] update gpu-memory-utilization flag docs (#9507)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-19 11:26:36 +08:00
Nick Hill
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily (#9521) 2024-10-18 20:21:01 -07:00
Joe Runde
380e18639f
🐛 fix torch memory profiling (#9516)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
sasha0552
337ed76671
[Bugfix] Fix offline mode when using mistral_common (#9457) 2024-10-18 18:12:32 -07:00
Thomas Parnell
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores (#9497)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-10-18 17:55:48 -07:00
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Kunjan
9bb10a7d27
[MISC] Add lora requests to metrics (#9477)
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
Michael Goin
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format (#9036) 2024-10-18 13:29:56 -06:00
Russell Bryant
67a7e5ef38
[CI/Build] Add error matching config for mypy (#9512) 2024-10-18 12:17:53 -07:00
Cyrus Leung
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Russell Bryant
7dbe738d65
[Misc] benchmark: Add option to set max concurrency (#9390)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-18 11:15:28 -07:00
Tyler Michael Smith
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py (#9505) 2024-10-18 16:59:19 +00:00
Cyrus Leung
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer (#9504) 2024-10-19 00:09:35 +08:00
Nick Hill
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming (#9475) 2024-10-18 14:10:26 +00:00
tomeras91
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser (#9154) 2024-10-18 10:27:48 +00:00
Nick Hill
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473) 2024-10-18 07:19:53 +00:00
Russell Bryant
944dd8edaf
[CI/Build] Use commit hash references for github actions (#9430) 2024-10-17 21:54:58 -07:00
Haoyu Wang
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 (#9467) 2024-10-18 04:40:14 +00:00
Joe Runde
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Dipika Sikka
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing (#9381) 2024-10-17 18:54:00 -07:00
Robert Shaw
343f8e0905
Support BERTModel (first encoder-only embedding model) (#9056)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-10-17 23:21:01 +00:00
Shashwat Srijan
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library (#9380) 2024-10-17 15:39:39 -07:00
sasha0552
d615b5c9f8
[Bugfix] Print warnings related to mistral_common tokenizer only once (#9468) 2024-10-17 21:44:20 +00:00
Kai Wu
d65049daab
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script (#9013)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-17 21:11:11 +00:00
bnellnm
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType (#9299) 2024-10-17 19:08:34 +00:00
Luka Govedič
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism (#9300) 2024-10-17 18:36:37 +00:00
Cyrus Leung
7871659abb
[Misc] Remove commit id file (#9470) 2024-10-17 10:34:37 -07:00
Daniele
a2c71c5405
[CI/Build] remove .github from .dockerignore, add dirty repo check (#9375) v0.6.3.post1 2024-10-17 10:25:06 -07:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Li, Jiang
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344) 2024-10-17 12:21:04 -04:00
Woosuk Kwon
8e1cddcd44
[TPU] Call torch._sync(param) during weight loading (#9437) 2024-10-17 09:00:11 -07:00
sasha0552
5e443b594f
[Bugfix] Allow prefill of assistant response when using mistral_common (#9446) 2024-10-17 15:06:37 +00:00
Lucas Wilkinson
9d30a056e7
[misc] CUDA Time Layerwise Profiler (#8337)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-10-17 10:36:09 -04:00
Cyrus Leung
390be74649
[Misc] Print stack trace using logger.exception (#9461) 2024-10-17 13:55:48 +00:00
Lucas Wilkinson
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors (#9395) 2024-10-17 09:48:26 -04:00
Yuan Tang
dbfa8d31d5
Add notes on the use of Slack (#9442) 2024-10-17 04:46:46 +00:00
rasmith
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391) 2024-10-17 01:34:06 +00:00
Tyler Michael Smith
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel (#9425) 2024-10-16 23:46:06 +00:00
Russell Bryant
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-16 22:55:59 +00:00
Lily Liu
8345045833
[Performance][Spec Decode] Optimize ngram lookup performance (#9333) 2024-10-16 13:37:45 -06:00
Junhao Li
5b8a1fde84
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft (#9396) 2024-10-16 16:40:24 +00:00
Mor Zusman
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189) 2024-10-16 12:12:43 -04:00
Patrick von Platen
415f76a9cb
Support mistral interleaved attn (#9414) 2024-10-16 13:28:30 +00:00
Isotr0py
cf1d62a644
[Model] Support SDPA attention for Molmo vision backbone (#9410) 2024-10-16 11:52:01 +00:00
Roger Wang
59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models (#9412)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-16 11:20:51 +00:00
Cyrus Leung
cee711fdbb
[Core] Rename input data types (#8688) 2024-10-16 10:49:37 +00:00
Cyrus Leung
1de76a0e55
[CI/Build] Test VLM embeddings (#9406) 2024-10-16 09:44:30 +00:00