Skip to content

Releases: huggingface/peft

LoRA+, VB-LoRA, and more

25 Sep 12:11
f0b066e
Compare
Choose a tag to compare

peft-v0 13 0

Highlights

New methods

LoRA+

@kallewoof added LoRA+ to PEFT (#1915). This is a function that allows to initialize an optimizer with settings that are better suited for training a LoRA adapter.

VB-LoRA

@leo-yangli added a new method to PEFT called VB-LoRA (#2039). The idea is to have LoRA layers be composed from a single vector bank (hence "VB") that is shared among all layers. This makes VB-LoRA extremely parameter efficient and the checkpoints especially small (comparable to the VeRA method), while still promising good fine-tuning performance. Check the VB-LoRA docs and example.

Enhancements

New Hugging Face team member @ariG23498 added the helper function rescale_adapter_scale to PEFT (#1951). Use this context manager to temporarily increase or decrease the scaling of the LoRA adapter of a model. It also works for PEFT adapters loaded directly into a transformers or diffusers model.

@ariG23498 also added DoRA support for embedding layers (#2006). So if you're using the use_dora=True option in the LoraConfig, you can now also target embedding layers.

For some time now, we support inference with batches that are using different adapters for different samples, so e.g. sample 1-5 use "adapter1" and samples 6-10 use "adapter2". However, this only worked for LoRA layers so far. @saeid93 extended this to also work with layers targeted by modules_to_save (#1990).

When loading a PEFT adapter, you now have the option to pass low_cpu_mem_usage=True (#1961). This will initialize the adapter with empty weights ("meta" device) before loading the weights instead of initializing on CPU or GPU. This can speed up loading PEFT adapters. So use this option especially if you have a lot of adapters to load at the same time or if these adapters are very big. Please let us know if you encounter issues with this option, as we may make this the default in the future.

Changes

Safe loading of PyTorch weights

Unless indicated otherwise, PEFT adapters are saved and loaded using the secure safetensors format. However, we also support the PyTorch format for checkpoints, which relies on the inherently insecure pickle protocol from Python. In the future, PyTorch will be more strict when loading these files to improve security by making the option weights_only=True the default. This is generally recommended and should not cause any trouble with PEFT checkpoints, which is why with this release, PEFT will enable this by default. Please open an issue if this causes trouble.

What's Changed

Read more

v0.12.0: New methods OLoRA, X-LoRA, FourierFT, HRA, and much more

24 Jul 11:55
e6cd24c
Compare
Choose a tag to compare

Highlights

peft-v0 12 0

New methods

OLoRA

@tokenizer-decode added support for a new LoRA initialization strategy called OLoRA (#1828). With this initialization option, the LoRA weights are initialized to be orthonormal, which promises to improve training convergence. Similar to PiSSA, this can also be applied to models quantized with bitsandbytes. Check out the accompanying OLoRA examples.

X-LoRA

@EricLBuehler added the X-LoRA method to PEFT (#1491). This is a mixture of experts approach that combines the strength of multiple pre-trained LoRA adapters. Documentation has yet to be added but check out the X-LoRA tests for how to use it.

FourierFT

@Phoveran, @zqgao22, @Chaos96, and @DSAILatHKUST added discrete Fourier transform fine-tuning to PEFT (#1838). This method promises to match LoRA in terms of performance while reducing the number of parameters even further. Check out the included FourierFT notebook.

HRA

@DaShenZi721 added support for Householder Reflection Adaptation (#1864). This method bridges the gap between low rank adapters like LoRA on the one hand and orthogonal fine-tuning techniques such as OFT and BOFT on the other. As such, it is interesting for both LLMs and image generation models. Check out the HRA example on how to perform DreamBooth fine-tuning.

Enhancements

  • IA³ now supports merging of multiple adapters via the add_weighted_adapter method thanks to @alexrs (#1701).
  • Call peft_model.get_layer_status() and peft_model.get_model_status() to get an overview of the layer/model status of the PEFT model. This can be especially helpful when dealing with multiple adapters or for debugging purposes. More information can be found in the docs (#1743).
  • DoRA now supports FSDP training, including with bitsandbytes quantization, aka QDoRA ()#1806).
  • VeRA has been extended by @dkopi to support targeting layers with different weight shapes (#1817).
  • @kallewoof added the possibility for ephemeral GPU offloading. For now, this is only implemented for loading DoRA models, which can be sped up considerably for big models at the cost of a bit of extra VRAM (#1857).
  • Experimental: It is now possible to tell PEFT to use your custom LoRA layers through dynamic dispatching. Use this, for instance, to add LoRA layers for thus far unsupported layer types without the need to first create a PR on PEFT (but contributions are still welcome!) (#1875).

Examples

Changes

Casting of the adapter dtype

Important: If the base model is loaded in float16 (fp16) or bfloat16 (bf16), PEFT now autocasts adapter weights to float32 (fp32) instead of using the dtype of the base model (#1706). This requires more memory than previously but stabilizes training, so it's the more sensible default. To prevent this, pass autocast_adapter_dtype=False when calling get_peft_model, PeftModel.from_pretrained, or PeftModel.load_adapter.

Adapter device placement

The logic of device placement when loading multiple adapters on the same model has been changed (#1742). Previously, PEFT would move all adapters to the device of the base model. Now, only the newly loaded/created adapter is moved to the base model's device. This allows users to have more fine-grained control over the adapter devices, e.g. allowing them to offload unused adapters to CPU more easily.

PiSSA

  • Calling save_pretrained with the convert_pissa_to_lora argument is deprecated, the argument was renamed to path_initial_model_for_weight_conversion (#1828). Also, calling this no longer deletes the original adapter (#1933).
  • Using weight conversion (path_initial_model_for_weight_conversion) while also using use_rslora=True and rank_pattern or alpha_pattern now raises an error (#1930). This used not to raise but inference would return incorrect outputs. We also warn about this setting during initialization.

Call for contributions

We are now making sure to tag appropriate issues with the contributions welcome label. If you are looking for a way to contribute to PEFT, check out these issues.

What's Changed

Read more

v0.11.1

17 May 12:55
Compare
Choose a tag to compare

Patch release v0.11.1

Fix a bug that could lead to C++ compilation errors after importing PEFT (#1738 #1739).

Full Changelog: v0.11.0...v0.11.1

v0.11.0: New PEFT methods BOFT, VeRA, PiSSA, quantization with HQQ and EETQ, and more

16 May 09:53
0649947
Compare
Choose a tag to compare

Highlights

peft-v0 11 0

New methods

BOFT

Thanks to @yfeng95, @Zeju1997, and @YuliangXiu, PEFT was extended with BOFT: Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization (#1326, BOFT paper link). In PEFT v0.7.0, we already added OFT, but BOFT is even more parameter efficient. Check out the included BOFT controlnet and BOFT dreambooth examples.

VeRA

If the parameter reduction of LoRA is not enough for your use case, you should take a close look at VeRA: Vector-based Random Matrix Adaptation (#1564, VeRA paper link). This method resembles LoRA but adds two learnable scaling vectors to the two LoRA weight matrices. However, the LoRA weights themselves are shared across all layers, considerably reducing the number of trainable parameters.

The bulk of this PR was implemented by contributor @vvvm23 with the help of @dkopi.

PiSSA

PiSSA, Principal Singular values and Singular vectors Adaptation, is a new initialization method for LoRA, which was added by @fxmeng (#1626, PiSSA paper link). The improved initialization promises to speed up convergence and improve the final performance of LoRA models. When using models quantized with bitsandbytes, PiSSA initialization should reduce the quantization error, similar to LoftQ.

Quantization

HQQ

Thanks to @fahadh4ilyas, PEFT LoRA linear layers now support Half-Quadratic Quantization, HQQ (#1618, HQQ repo). HQQ is fast and efficient (down to 2 bits), while not requiring calibration data.

EETQ

Another new quantization method supported in PEFT is Easy & Efficient Quantization for Transformers, EETQ (#1675, EETQ repo). This 8 bit quantization method works for LoRA linear layers and should be faster than bitsandbytes.

Show adapter layer and model status

We added a feature to show adapter layer and model status of PEFT models in #1663. With the newly added methods, you can easily check what adapters exist on your model, whether gradients are active, whether they are enabled, which ones are active or merged. You will also be informed if irregularities have been detected.

To use this new feature, call model.get_layer_status() for layer-level information, and model.get_model_status() for model-level information. For more details, check out our docs on layer and model status.

Changes

Edge case of how we deal with modules_to_save

We had the issue that when we were using classes such as PeftModelForSequenceClassification, we implicitly added the classifier layers to model.modules_to_save. However, this would only add a new ModulesToSaveWrapper instance for the first adapter being initialized. When initializing a 2nd adapter via model.add_adapter, this information was ignored. Now, peft_config.modules_to_save is updated explicitly to add the classifier layers (#1615). This is a departure from how this worked previously, but it reflects the intended behavior better.

Furthermore, when merging together multiple LoRA adapters using model.add_weighted_adapter, if these adapters had modules_to_save, the original parameters of these modules would be used. This is unexpected and will most likely result in bad outputs. As there is no clear way to merge these modules, we decided to raise an error in this case (#1615).

What's Changed

Read more

v0.10.0: Fine-tune larger QLoRA models with DeepSpeed and FSDP, layer replication, enhance DoRA

21 Mar 10:20
8221246
Compare
Choose a tag to compare

Highlights

image

Support for QLoRA with DeepSpeed ZeRO3 and FSDP

We added a couple of changes to allow QLoRA to work with DeepSpeed ZeRO3 and Fully Sharded Data Parallel (FSDP). For instance, this allows you to fine-tune a 70B Llama model on two GPUs with 24GB memory each. Besides the latest version of PEFT, this requires bitsandbytes>=0.43.0, accelerate>=0.28.0, transformers>4.38.2, trl>0.7.11. Check out our docs on DeepSpeed and FSDP with PEFT, as well as this blogpost from answer.ai, for more details.

Layer replication

First time contributor @siddartha-RE added support for layer replication with LoRA. This allows you to duplicate layers of a model and apply LoRA adapters to them. Since the base weights are shared, this costs only very little extra memory, but can lead to a nice improvement of model performance. Find out more in our docs.

Improving DoRA

Last release, we added the option to enable DoRA in PEFT by simply adding use_dora=True to your LoraConfig. However, this only worked for non-quantized linear layers. With this PEFT release, we now also support Conv2d layers, as well as linear layers quantized with bitsandbytes.

Mixed LoRA adapter batches

If you have a PEFT model with multiple LoRA adapters attached to it, it's now possible to apply different adapters (or, in fact, no adapter) on different samples in the same batch. To do this, pass a list of adapter names as an additional argument. For example, if you have a batch of three samples:

output = model(**inputs, adapter_names=["adapter1", "adapter2", "__base__"])`

Here, "adapter1" and "adapter2" should be the same name as your corresponding LoRA adapters and "__base__" is a special name that refers to the base model without any adapter. Find more details in our docs.

Without this feature, if you wanted to run inference with different LoRA adapters, you'd have to use single samples or try to group batches with the same adapter, then switch between adapters using set_adapter -- this is inefficient and inconvenient. Therefore, it is recommended to use this new, faster method from now on when encountering this scenario.

New LoftQ initialization function

We added an alternative way to initialize LoRA weights for a quantized model using the LoftQ method, which can be more convenient than the existing method. Right now, using LoftQ requires you to go through multiple steps as shown here. Furthermore, it's necessary to keep a separate copy of the quantized weights, as those are not identical to the quantized weights from the default model.

Using the new replace_lora_weights_loftq function, it's now possible to apply LoftQ initialization in a single step and without the need for extra copies of the weights. Check out the docs and this example notebook to see how it works. Right now, this method only supports 4bit quantization with bitsandbytes, and the model has to be stored in the safetensors format.

Deprecations

The function prepare_model_for_int8_training was deprecated for quite some time and is now removed completely. Use prepare_model_for_kbit_training instead.

What's Changed

Besides these highlights, we added many small improvements and fixed a couple of bugs. All these changes are listed below. As always, we thank all the awesome contributors who helped us improve PEFT.

New Contributors

Full Changelog: v0.9.0...v0.10.0

v0.9.0: Merging LoRA weights, new quantization options, DoRA support, and more

28 Feb 10:37
7e5335d
Compare
Choose a tag to compare

Highlights

New methods for merging LoRA weights together

cat_teapot

With PR #1364, we added new methods for merging LoRA weights together. This is not about merging LoRA weights into the base model. Instead, this is about merging the weights from different LoRA adapters into a single adapter by calling add_weighted_adapter. This allows you to combine the strength from multiple LoRA adapters into a single adapter, while being faster than activating each of these adapters individually.

Although this feature has already existed in PEFT for some time, we have added new merging methods that promise much better results. The first is based on TIES, the second on DARE and a new one inspired by both called Magnitude Prune. If you haven't tried these new methods, or haven't touched the LoRA weight merging feature at all, you can find more information here:

AWQ and AQLM support for LoRA

Via #1394, we now support AutoAWQ in PEFT. This is a new method for 4bit quantization of model weights.

Screenshot 2024-02-28 at 09 41 40

Similarly, we now support AQLM via #1476. This method allows to quantize weights to as low as 2 bits. Both methods support quantizing nn.Linear layers. To find out more about all the quantization options that work with PEFT, check out our docs here.

Screenshot 2024-02-28 at 09 42 22

Note these integrations do not support merge_and_unload() yet, meaning for inference you need to always attach the adapter weights into the base model

DoRA support

We now support Weight-Decomposed Low-Rank Adaptation aka DoRA via #1474. This new method is builds on top of LoRA and has shown very promising results. Especially at lower ranks (e.g. r=8), it should perform much better than LoRA. Right now, only non-quantized nn.Linear layers are supported. If you'd like to give it a try, just pass use_dora=True to your LoraConfig and you're good to go.

Documentation

Thanks to @stevhliu and many other contributors, there have been big improvements to the documentation. You should find it more organized and more up-to-date. Our DeepSpeed and FSDP guides have also been much improved.

Check out our improved docs if you haven't already!

Development

If you're implementing custom adapter layers, for instance a custom LoraLayer, note that all subclasses should now implement update_layer -- unless they want to use the default method by the parent class. In particular, this means you should no longer use different method names for the subclass, like update_layer_embedding. Also, we generally don't permit ranks (r) of 0 anymore. For more, see this PR.

Developers should have an easier time now since we fully embrace ruff. If you're the type of person who forgets to call make style before pushing to a PR, consider adding a pre-commit hook. Tests are now a bit less verbose by using plain asserts and generally embracing pytest features more fully. All of this comes thanks to @akx.

What's Changed

On top of these changes, we have added a lot of small changes since the last release, check out the full changes below. As always, we had a lot of support by many contributors, you're awesome!

Read more

Release v0.8.2

01 Feb 14:16
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.8.1...v0.8.2

Patch Release v0.8.1

30 Jan 10:48
5e4aa7e
Compare
Choose a tag to compare

This is a small patch release of PEFT that should:

  • Fix breaking change related to support for saving resized embedding layers and Diffusers models. Contributed by @younesbelkada in #1414

What's Changed

Full Changelog: v0.8.0...v0.8.1

v0.8.0: Poly PEFT method, LoRA improvements, Documentation improvements and more

30 Jan 06:59
30889ef
Compare
Choose a tag to compare

Highlights

Poly PEFT method

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists of pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (𝙿𝚘𝚕𝚢) jointly learns an inventory of adapters and a routing function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation. To put simply, you can think of it as Mixture of Expert Adapters.
𝙼𝙷𝚁 (Multi-Head Routing) combines subsets of adapter parameters and outperforms 𝙿𝚘𝚕𝚢 under a comparable parameter budget; by only fine-tuning the routing function and not the adapters (𝙼𝙷𝚁-z) they achieve competitive performance with extreme parameter efficiency.

LoRA improvements

Now, you can specify all-linear to target_modules param of LoraConfig to target all the linear layers which has shown to perform better in QLoRA paper than only targeting query and valuer attention layers

  • Add an option 'ALL' to include all linear layers as target modules by @SumanthRH in #1295

Embedding layers of base models are now automatically saved when the embedding layers are resized when fine-tuning with PEFT approaches like LoRA. This enables extending the vocabulary of tokenizer to include special tokens. This is a common use-case when doing the following:

  1. Instruction finetuning with new tokens being added such as <|user|>, <|assistant|>, <|system|>, <|im_end|>, <|im_start|>, </s>, <s> to properly format the conversations
  2. Finetuning on a specific language wherein language specific tokens are added, e.g., Korean tokens being added to vocabulary for finetuning LLM on Korean datasets.
  3. Instruction finetuning to return outputs in a certain format to enable agent behaviour of new tokens such as <|FUNCTIONS|>, <|BROWSE|>, <|TEXT2IMAGE|>, <|ASR|>, <|TTS|>, <|GENERATECODE|>, <|RAG|>.
    A good blogpost to learn more about this https://www.philschmid.de/fine-tune-llms-in-2024-with-trl.
  • save the embeddings even when they aren't targetted but resized by @pacman100 in #1383

New option use_rslora in LoraConfig. Use it for ranks greater than 32 and see the increase in fine-tuning performance (same or better performance for ranks lower than 32 as well).

Documentation improvements

  • Refactoring and updating of the concept guides. [docs] Concept guides by @stevhliu in #1269
  • Improving task guides to focus more on how to use different PEFT methods and related nuances instead of focusing more on different type of tasks. It condenses the individual guides into a single one to highlight the commonalities and differences, and to refer to existing docs to avoid duplication. [docs] Task guides by @stevhliu in #1332
  • DOC: Update docstring for the config classes by @BenjaminBossan in #1343
  • LoftQ: edit README.md and example files by @yxli2123 in #1276
  • [Docs] make add_weighted_adapter example clear in the docs. by @sayakpaul in #1353
  • DOC Add PeftMixedModel to API docs by @BenjaminBossan in #1354
  • [docs] Docstring link by @stevhliu in #1356
  • QOL improvements and doc updates by @pacman100 in #1318
  • Doc about AdaLoraModel.update_and_allocate by @kuronekosaiko in #1341
  • DOC: Improve target modules description by @BenjaminBossan in #1290
  • DOC Troubleshooting for unscaling error with fp16 by @BenjaminBossan in #1336
  • DOC Extending the vocab and storing embeddings by @BenjaminBossan in #1335
  • Improve documentation for the all-linear flag by @SumanthRH in #1357
  • Fix various typos in LoftQ docs. by @arnavgarg1 in #1408

What's Changed

Read more

v0.7.1 patch release

12 Dec 17:22
67a0800
Compare
Choose a tag to compare

This is a small patch release of PEFT that should handle:

  • Issues with loading multiple adapters when using quantized models (#1243)
  • Issues with transformers v4.36 and some prompt learning methods (#1252)

What's Changed

New Contributors

Full Changelog: v0.7.0...v0.7.1