Serving Thousands of Concurrent LoRA Adapters

[ad_1]

[Submitted on 6 Nov 2023 (v1), last revised 7 Nov 2023 (this version, v2)]

Authors:Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica

Download a PDF of the paper titled S-LoRA: Serving Thousands of Concurrent LoRA Adapters, by Ying Sheng and 11 other authors

Download PDF

Abstract:The “pretrain-then-finetune” paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at this https URL

Submission history

From: Ying Sheng [view email]
[v1]
Mon, 6 Nov 2023 17:26:17 UTC (259 KB)
[v2]
Tue, 7 Nov 2023 06:59:33 UTC (274 KB)

[ad_2]

Serving Thousands of Concurrent LoRA Adapters – The TechLead

Submission history

Leave a Reply Cancel reply

Submission history

Related Posts

Every foldable phone we’re expecting in 2024 – The TechLead

Altman Plans To Raise $7 Trillion For His Upcoming AI Chip Project – The TechLead

Podcast: How to prepare your data for digital transformation – The TechLead

Leave a Reply Cancel reply