PEFT - Adapter Tuning

NLP
PEFT
Fine Tuning
Perameter efficient finetuning using Adapters
Author

Uday

Published

September 18, 2024

Large pre-trained language models (e.g., BERT, GPT) have revolutionized NLP tasks by leveraging massive amounts of unlabeled data. Transfer learning involves first pre-training these models on large corpora and then fine-tuning them on smaller, task-specific datasets. However, fine-tuning all the parameters of a model like BERT is computationally expensive and inefficient, particularly when there are multiple downstream tasks

Adapter Layers

Basic Adapter Design

Adapter Design

Adapter Fusion

Adapter Fusion

COMPACTER (Compact Adapter)

COMPACTER is combination of Hypercomplex Adapter Layers using Kronecker Product, Low-Rank Approximation, shared weights across adapters.

Kronecker Product:

Kronecker Product

Hypercomplex Adapter Layers:

In the adapter layers, previously it used the FC layers as below

\[\begin{align} y = Wx + b \quad \text{where } W \text{ is } (m \times d)\\ \end{align}\]

W will be replaced using Kronecker Product of two matrices like below

\[\begin{align} W = \sum_{i=1}^n A_i \otimes B_i \\ A_i \text{ is } (n \times n) \quad , \quad B_i \text{ is } (\frac{m}{n} \times \frac{d}{n}) \\ \end{align}\]

n is user defined hyper-parameter. d, m are must divisible by n

Below is the illustration of Hypercomplex Adapter Layers. It is sum of Kronecker Product of matrices \(A_i\), \(B_i\) and here n = 2, d = 8, m = 6

PHS

No of parameters to tune here is reduced compared to FC layer as shown above.

This layer is generalization of the FC layer via the hyperparameter n. 

Low Rank Parameterization and Sharing information across adapters

\[\begin{align} W = \sum_{i=1}^n A_i \otimes B_i = \sum_{i=1}^n A_i \otimes (s_i t_i)\\ s_i \text{ is } (\frac{m}{n} \times r) \quad t_i \text{ is } (r \times \frac{d}{n}) \quad \text{matrix} \end{align}\]

Reference:

  1. https://arxiv.org/pdf/1902.00751
  2. https://arxiv.org/pdf/2005.00247
  3. https://arxiv.org/pdf/2102.08597