1760 Commits

Author SHA1 Message Date
saumya-saran
b28298f2f4
[Bugfix] Validate SamplingParam n is an int (#8548) 2024-09-20 12:46:02 -07:00
Niklas Muennighoff
3b63de9353
[Model] Add OLMoE (#7922) 2024-09-20 09:31:41 -07:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management (#6315) 2024-09-20 06:20:56 +00:00
William Lin
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474) 2024-09-19 20:49:54 -07:00
Amit Garg
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference (#8571) 2024-09-20 08:54:02 +08:00
盏一
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p (#8619) 2024-09-19 18:28:25 +00:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) 2024-09-19 17:37:57 +00:00
Nick Hill
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too (#8584) 2024-09-19 12:51:06 -04:00
Roger Wang
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593) 2024-09-19 04:14:28 +00:00
sroy745
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545) 2024-09-19 02:24:15 +00:00
Joe Runde
0d47bf3bf4
[Bugfix] add dead_error property to engine client (#8574)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-18 22:10:01 +00:00
Nick Hill
d9cd78eb71
[BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572) 2024-09-18 20:17:55 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) 2024-09-18 20:05:06 +00:00
Gregory Shtrasberg
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 10:41:08 -07:00
Geun, Lim
e18749ff09
[Model] Support Solar Model (#8386)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-18 11:04:00 -06:00
Russell Bryant
d65798f78c
[Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-09-18 16:10:27 +00:00
afeldman-nm
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step (#8199) 2024-09-18 08:38:43 -07:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Jiaxin Shan
e351572900
[Misc] Add argument to disable FastAPI docs (#8554) 2024-09-18 09:51:59 +00:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
Joe Runde
98f9713399
[Bugfix] Fix TP > 1 for new granite (#8544)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-17 23:17:08 +00:00
Nick Hill
56c3de018c
[Misc] Don't dump contents of kvcache tensors on errors (#8527) 2024-09-17 12:24:29 -07:00
Patrick von Platen
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-17 17:50:37 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) 2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) 2024-09-17 07:35:01 -07:00
Rui Qiao
cbdb252259
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (#8509)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-09-17 00:06:26 -07:00
youkaichao
99aa4eddaf
[torch.compile] register allreduce operations as custom ops (#8526) 2024-09-16 22:57:57 -07:00
Roger Wang
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521) 2024-09-16 22:22:45 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing (#8525)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-17 04:17:32 +00:00
Simon Mo
546034b466
[refactor] remove triton based sampler (#8524) 2024-09-16 20:04:48 -07:00
Kevin Lin
47f5e03b5b
[Bugfix] Bind api server port before starting engine (#8491) 2024-09-16 13:56:28 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) 2024-09-16 11:52:40 -07:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues (#8492) 2024-09-16 09:33:46 -07:00
lewtun
837c1968f9
[Frontend] Expose revision arg in OpenAI server (#8501) 2024-09-16 15:55:26 +00:00
ElizaWszola
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) 2024-09-15 16:51:44 -06:00
Chris
3724d5f6b5
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (#8490) 2024-09-15 04:20:05 +00:00
Woosuk Kwon
50e9ec41fc
[TPU] Implement multi-step scheduling (#8489) 2024-09-14 16:58:31 -07:00
youkaichao
47790f3e32
[torch.compile] add a flag to disable custom op (#8488) 2024-09-14 13:07:16 -07:00
youkaichao
a36e070dad
[torch.compile] fix functionalization (#8480) 2024-09-14 09:46:04 -07:00
ywfang
8a0cf1ddc3
[Model] support minicpm3 (#8297)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-14 14:50:26 +00:00
Charlie Fu
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) 2024-09-13 17:01:11 -07:00
Kunshang Ji
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
Co-authored-by: Yan Ma <yan.ma@intel.com>
2024-09-13 16:54:34 -07:00
Simon Mo
9ba0817ff1
bump version to v0.6.1.post2 (#8473) 2024-09-13 11:35:00 -07:00
Nick Hill
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming (#8468) 2024-09-13 11:31:12 -07:00
youkaichao
0a4806f0a9
[plugin][torch.compile] allow to add custom compile backend (#8445) 2024-09-13 09:32:42 -07:00
Jee Jee Li
06311e2956
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442) 2024-09-13 07:58:28 +00:00
Simon Mo
acda0b35d0
bump version to v0.6.1.post1 (#8440) 2024-09-12 21:39:49 -07:00