## Transformers – Improving Natural Language Processing with Attention Mechanisms

This lecture is **NOT** going to be building Transformers from scratch. This will be a lecture on the evolution of Transformers from RNNs

In [None]:
!pip install transformers datasets lightning

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting lightning
  Downloading lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning)
  Downloading lightning_utilities-0.11.9-py3-none-any.whl.metadata (5.2 kB)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning)
  Downloading torchmetrics-1.6.0-py3-none-any.whl.metadata (20 kB)
Collecting pytorch-lightning (from lightning)
[0m  Downloa

In [None]:
from IPython.display import Image

### Review of RNN



### Adding an attention mechanism to RNNs

#### Attention helps RNNs with accessing information

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig1.png)

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig2.png)

### The original attention mechanism for RNNs


![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig3.png)

### Processing the inputs using a bidirectional RNN

1. **Bidirectional RNN #1**:  
   - The first RNN (RNN #1) in the attention-based model is bidirectional, generating **context vectors** $ c_i $.  
   - A context vector is an enhanced version of the input vector $ x(i) $, incorporating information from the entire input sequence using an **attention mechanism**.

2. **Processing the Sequence**:  
   - RNN #1 processes the input sequence both **forward** (1 → $ T $) and **backward** ($ T $ → 1).  
   - This bidirectional approach captures dependencies between current inputs and sequence elements that occur **before and after**.

3. **Hidden State Construction**:  
   - Each input element has two hidden states:  
     - $ h_F(i) $: from the forward pass.  
     - $ h_B(i) $: from the backward pass.  
   - These are concatenated to form a **composite hidden state** $ h(i) $.  
     - Example: If $ h_F(i) $ and $ h_B(i) $ are 128-dimensional, the concatenated $ h(i) $ has 256 dimensions.

4. **Purpose of Hidden States**:  
   - The concatenated $ h(i) $ represents an **annotation** of the source word, containing context from both directions.

5. **Role in Attention Mechanism**:  
   - RNN #2 uses these context vectors prepared by RNN #1 to generate outputs, which will be explored further in the next section.  


### Generating outputs from context vectors

1. **RNN #2 Overview**:  
   - RNN #2 is the **main RNN** responsible for generating the outputs.  
   - It receives **context vectors** $ c_i $ and hidden states as inputs.

2. **Context Vector Computation**:  
   - A context vector $ c_i $ is a **weighted sum** of the concatenated hidden states $ h(1), h(2), ..., h(T) $ from RNN #1.  
   - Formula:  
     $
     c_i = \sum_{j=1}^T \alpha_{ij} h(j)
     $  
   - $ \alpha_{ij} $: **Attention weights**, indicating the importance of input sequence element $ j $ for the current output $ i $.  
   - Each output $ i $ has a unique set of attention weights.

3. **Inputs to RNN #2**:  
   - At each time step $ i $, RNN #2 takes:  
     - The **context vector** $ c_i $.  
     - The **previous hidden state** $ s(i-1) $.  
     - The **previous target word** $ y(i-1) $ (during training).  

4. **Training vs. Inference**:  
   - **During training**: The correct word $ y(i-1) $ is fed into the next state.  
   - **During inference**: The predicted output $ o(i-1) $ is used instead.

5. **Output Generation**:  
   - Using the inputs above, RNN #2 generates the predicted output $ o(i) $ for the target word $ y(i) $.

6. **Key Takeaways**:  
   - **RNN #1**: Prepares context vectors using attention over the input sequence.  
   - **RNN #2**: Combines context vectors, hidden states, and previous outputs to generate predictions.  
   - The computation of attention weights $ \alpha_{ij} $ will be discussed in the next section.  


### Computing the attention weights

1. **Attention Weights**:  
   - Attention weights $ \alpha_{ij} $ connect **inputs (annotations)** to **outputs (contexts)**.  
   - **Subscripts**:  
     - $ j $: Input index.  
     - $ i $: Output index.  
   - **Alignment Scores**:  
     - Attention weights are normalized alignment scores $ e_{ij} $.  
     - Formula:  
       $
       \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^T \exp(e_{ik})}
       $  
       - Similar to the softmax function, ensuring $ \sum_{j=1}^T \alpha_{ij} = 1 $.

2. **Purpose of Attention Weights**:  
   - Evaluate how well input at position $ j $ aligns with output at position $ i $.  
   - Guide the model in emphasizing relevant input elements when generating outputs.

3. **Three Key Components of the Attention-Based RNN**:  
   1. **Input Annotation (RNN #1)**:  
      - Computes **bidirectional annotations** of the input sequence.  
   2. **Recurrent Block (RNN #2)**:  
      - Similar to a regular RNN but uses **context vectors** instead of raw inputs.  
   3. **Attention Mechanism**:  
      - Computes attention weights and context vectors, linking each input-output pair.  

4. **Comparison with Transformer Models**:  
   - Transformers **do not use recurrence**.  
   - Instead, they rely solely on a **self-attention mechanism**.  
   - Process the entire input sequence at once, unlike the step-by-step processing in RNNs.  

5. **Next Steps**:  
   - Explore the **self-attention mechanism**, a foundation for the transformer architecture.  


### Introducing the self-attention mechanism

### Starting with a basic form of self-attention

- Assume we have an input sentence that we encoded via a dictionary, which maps the words to integers as discussed in the RNN chapter:

In [None]:
import torch


# input sequence / sentence:
#  "Can you help me to translate this sentence"

sentence = torch.tensor(
    [0, # can
     7, # you
     1, # help
     2, # me
     5, # to
     6, # translate
     4, # this
     3] # sentence
)

sentence

tensor([0, 7, 1, 2, 5, 6, 4, 3])

- Next, assume we have an embedding of the words, i.e., the words are represented as real vectors.
- Since we have 8 words, there will be 8 vectors. Each vector is 16-dimensional:

In [None]:
torch.manual_seed(123)
embed = torch.nn.Embedding(10, 16)
embedded_sentence = embed(sentence).detach()
embedded_sentence.shape

torch.Size([8, 16])

- The goal is to compute the context vectors $\boldsymbol{z}^{(i)}=\sum_{j=1}^{T} \alpha_{i j} \boldsymbol{x}^{(j)}$, which involve attention weights $\alpha_{i j}$.
- In turn, the attention weights $\alpha_{i j}$ involve the $\omega_{i j}$ values
- Let's start with the $\omega_{i j}$'s first, which are computed as dot-products:

$$\omega_{i j}=\boldsymbol{x}^{(i)^{\top}} \boldsymbol{x}^{(j)}$$



In [None]:
omega = torch.empty(8, 8)

for i, x_i in enumerate(embedded_sentence):
    for j, x_j in enumerate(embedded_sentence):
        omega[i, j] = torch.dot(x_i, x_j)

- Actually, let's compute this more efficiently by replacing the nested for-loops with a matrix multiplication:

In [None]:
omega_mat = embedded_sentence.matmul(embedded_sentence.T)

In [None]:
torch.allclose(omega_mat, omega)

True

- Next, let's compute the attention weights by normalizing the "omega" values so they sum to 1

$$\alpha_{i j}=\frac{\exp \left(\omega_{i j}\right)}{\sum_{j=1}^{T} \exp \left(\omega_{i j}\right)}=\operatorname{softmax}\left(\left[\omega_{i j}\right]_{j=1 \ldots T}\right)$$

$$\sum_{j=1}^{T} \alpha_{i j}=1$$

In [None]:
import torch.nn.functional as F

attention_weights = F.softmax(omega, dim=1)
attention_weights.shape

torch.Size([8, 8])

- We can confirm that the columns sum up to one:

In [None]:
attention_weights.sum(dim=1)

tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig4.png)

- Now that we have the attention weights, we can compute the context vectors $\boldsymbol{z}^{(i)}=\sum_{j=1}^{T} \alpha_{i j} \boldsymbol{x}^{(j)}$, which involve attention weights $\alpha_{i j}$
- For instance, to compute the context-vector of the 2nd input element (the element at index 1), we can perform the following computation:

In [None]:
x_2 = embedded_sentence[1, :]
context_vec_2 = torch.zeros(x_2.shape)
for j in range(8):
    x_j = embedded_sentence[j, :]
    context_vec_2 += attention_weights[1, j] * x_j

context_vec_2

tensor([-9.3975e-01, -4.6856e-01,  1.0311e+00, -2.8192e-01,  4.9373e-01,
        -1.2896e-02, -2.7327e-01, -7.6358e-01,  1.3958e+00, -9.9543e-01,
        -7.1287e-04,  1.2449e+00, -7.8077e-02,  1.2765e+00, -1.4589e+00,
        -2.1601e+00])

- Or, more effiently, using linear algebra and matrix multiplication:

In [None]:
context_vectors = torch.matmul(
        attention_weights, embedded_sentence)


torch.allclose(context_vec_2, context_vectors[1])

True

###  Parameterizing the self-attention mechanism: scaled dot-product attention

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig5.png)

In [None]:
torch.manual_seed(123)

d = embedded_sentence.shape[1]
U_query = torch.rand(d, d)
U_key = torch.rand(d, d)
U_value = torch.rand(d, d)

In [None]:
x_2 = embedded_sentence[1]
query_2 = U_query.matmul(x_2)

In [None]:
key_2 = U_key.matmul(x_2)
value_2 = U_value.matmul(x_2)

In [None]:
keys = U_key.matmul(embedded_sentence.T).T
torch.allclose(key_2, keys[1])

True

In [None]:
values = U_value.matmul(embedded_sentence.T).T
torch.allclose(value_2, values[1])

True

In [None]:
omega_23 = query_2.dot(keys[2])
omega_23

tensor(14.3667)

In [None]:
omega_2 = query_2.matmul(keys.T)
omega_2

tensor([-25.1623,   9.3602,  14.3667,  32.1482,  53.8976,  46.6626,  -1.2131,
        -32.9392])

In [None]:
attention_weights_2 = F.softmax(omega_2 / d**0.5, dim=0)
attention_weights_2

tensor([2.2317e-09, 1.2499e-05, 4.3696e-05, 3.7242e-03, 8.5596e-01, 1.4026e-01,
        8.8897e-07, 3.1935e-10])

In [None]:
#context_vector_2nd = torch.zeros(values[1, :].shape)
#for j in range(8):
#    context_vector_2nd += attention_weights_2[j] * values[j, :]

#context_vector_2nd

In [None]:
context_vector_2 = attention_weights_2.matmul(values)
context_vector_2

tensor([-1.2226, -3.4387, -4.3928, -5.2125, -1.1249, -3.3041, -1.4316, -3.2765,
        -2.5114, -2.6105, -1.5793, -2.8433, -2.4142, -0.3998, -1.9917, -3.3499])

### Attention is all we need: introducing the original transformer architecture


![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig6.png)

###  Encoding context embeddings via multi-head attention

In [None]:
torch.manual_seed(123)

d = embedded_sentence.shape[1]
one_U_query = torch.rand(d, d)

In [None]:
h = 8
multihead_U_query = torch.rand(h, d, d)
multihead_U_key = torch.rand(h, d, d)
multihead_U_value = torch.rand(h, d, d)

In [None]:
multihead_query_2 = multihead_U_query.matmul(x_2)
multihead_query_2.shape

torch.Size([8, 16])

In [None]:
multihead_key_2 = multihead_U_key.matmul(x_2)
multihead_value_2 = multihead_U_value.matmul(x_2)

In [None]:
multihead_key_2[2]

tensor([-1.9619, -0.7701, -0.7280, -1.6840, -1.0801, -1.6778,  0.6763,  0.6547,
         1.4445, -2.7016, -1.1364, -1.1204, -2.4430, -0.5982, -0.8292, -1.4401])

In [None]:
stacked_inputs = embedded_sentence.T.repeat(8, 1, 1)
stacked_inputs.shape

torch.Size([8, 16, 8])

In [None]:
multihead_keys = torch.bmm(multihead_U_key, stacked_inputs)
multihead_keys.shape

torch.Size([8, 16, 8])

In [None]:
multihead_keys = multihead_keys.permute(0, 2, 1)
multihead_keys.shape

torch.Size([8, 8, 16])

In [None]:
multihead_keys[2, 1] # index: [2nd attention head, 2nd key]

tensor([-1.9619, -0.7701, -0.7280, -1.6840, -1.0801, -1.6778,  0.6763,  0.6547,
         1.4445, -2.7016, -1.1364, -1.1204, -2.4430, -0.5982, -0.8292, -1.4401])

In [None]:
multihead_values = torch.matmul(multihead_U_value, stacked_inputs)
multihead_values = multihead_values.permute(0, 2, 1)

In [None]:
multihead_z_2 = torch.rand(8, 16)

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig7.png)

In [None]:
linear = torch.nn.Linear(8*16, 16)
context_vector_2 = linear(multihead_z_2.flatten())
context_vector_2.shape

torch.Size([16])

### Learning a language model: decoder and masked multi-head attention

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig8.png)

### Implementation details: positional encodings and layer normalization

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig9.png)

### Building large-scale language models by leveraging unlabeled data
###  Pre-training and fine-tuning transformer models

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig10.png)

### Leveraging unlabeled data with GPT

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig11.png)


#### Using GPT-2 to generate new text


In [None]:
from transformers import pipeline, set_seed


generator = pipeline('text-generation', model='gpt2')
set_seed(123)
generator("Hey readers, today is",
          max_length=20,
          num_return_sequences=3)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Hey readers, today is the third day in a row where I am starting to get a little fed'},
 {'generated_text': 'Hey readers, today is a very important weekend, and thanks to all of you, will be a'},
 {'generated_text': 'Hey readers, today is the third day of the New Year after I posted a series on the Internet'}]

In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "Let us encode this sentence"
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

{'input_ids': tensor([[ 5756,   514, 37773,   428,  6827]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

In [None]:
from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')

In [None]:
from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')

### Bidirectional pre-training with BERT

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig12.png)


![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig13.png)

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig14.png)


### BART model

![](https://raw.githubusercontent.com/cfteach/NNDL_DATA621/refs/heads/webpage-src/DATA621/DATA621/images/transformers/Fig15.png)

### Finetuning a DistilBERT Classifier Using the Lightning Trainer



In [None]:
!pip install datasets



In [None]:
# 1 Loading the Dataset
from datasets import load_dataset
imdb_data = load_dataset("imdb")
print(imdb_data)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [None]:
imdb_data = load_dataset("imdb")
print(imdb_data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm cd into the download directory and execute

    tar -zxf aclImdb_v1.tar.gz

B) If you are working with Windows, download an archiver such as 7Zip to extract the files from the download archive.

C) Use the following code to download and unzip the dataset via Python


In [None]:
import os
import sys
import tarfile
import time
import urllib.request

source = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
target = "aclImdb_v1.tar.gz"

if os.path.exists(target):
    os.remove(target)


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.0**2 * duration)
    percent = count * block_size * 100.0 / total_size

    sys.stdout.write(
        f"\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB "
        f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed"
    )
    sys.stdout.flush()


if not os.path.isdir("aclImdb") and not os.path.isfile("aclImdb_v1.tar.gz"):
    urllib.request.urlretrieve(source, target, reporthook)

100% | 80.23 MB | 3.07 MB/s | 26.16 sec elapsed

In [None]:
if not os.path.isdir("aclImdb"):

    with tarfile.open(target, "r:gz") as tar:
        tar.extractall()

In [None]:
# convert dataframe and save as csv

import os
import sys

import numpy as np
import pandas as pd
from packaging import version
from tqdm import tqdm

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = "aclImdb"

labels = {"pos": 1, "neg": 0}

df = pd.DataFrame()

with tqdm(total=50000) as pbar:
    for s in ("test", "train"):
        for l in ("pos", "neg"):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                    txt = infile.read()

                if version.parse(pd.__version__) >= version.parse("1.3.2"):
                    x = pd.DataFrame(
                        [[txt, labels[l]]], columns=["review", "sentiment"]
                    )
                    df = pd.concat([df, x], ignore_index=False)

                else:
                    df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()
df.columns = ["text", "label"]

100%|██████████| 50000/50000 [01:01<00:00, 807.83it/s]


In [None]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

### Basic checks

In [None]:
print("Class distribution:")
np.bincount(df["label"].values)

Class distribution:


array([25000, 25000])

In [None]:
text_len = df["text"].apply(lambda x: len(x.split()))
text_len.min(), text_len.median(), text_len.max()

(4, 173.0, 2470)

### Splitting into training validation and testing

In [None]:
df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:35_000]
df_val = df_shuffled.iloc[35_000:40_000]
df_test = df_shuffled.iloc[40_000:]

df_train.to_csv("train.csv", index=False, encoding="utf-8")
df_val.to_csv("validation.csv", index=False, encoding="utf-8")
df_test.to_csv("test.csv", index=False, encoding="utf-8")

### Tokenization and numericalization

In [None]:
imdb_dataset = load_dataset(
    "csv",
    data_files={
        "train": "train.csv",
        "validation": "validation.csv",
        "test": "test.csv",
    },
)

print(imdb_dataset)

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 35000
    })
    validation: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 10000
    })
})


Tokenize the dataset

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokenizer input max length: 512
Tokenizer vocabulary size: 30522


In [None]:
def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

In [None]:
imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)

Map:   0%|          | 0/35000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
#del imdb_dataset

In [None]:
imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Setup dataloaders

In [None]:
from torch.utils.data import DataLoader, Dataset


class IMDBDataset(Dataset):
    def __init__(self, dataset_dict, partition_key="train"):
        self.partition = dataset_dict[partition_key]

    def __getitem__(self, index):
        return self.partition[index]

    def __len__(self):
        return self.partition.num_rows

In [None]:
train_dataset = IMDBDataset(imdb_tokenized, partition_key="train")
val_dataset = IMDBDataset(imdb_tokenized, partition_key="validation")
test_dataset = IMDBDataset(imdb_tokenized, partition_key="test")

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True,
    num_workers=4
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=12,
    num_workers=4
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=12,
    num_workers=4
)



In [None]:
### Initializing DistilBERT

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Finetuning with lightning

In [None]:
# wrap in lightning module

import lightning as L
import torch
import torchmetrics


class LightningModel(L.LightningModule):
    def __init__(self, model, learning_rate=5e-5):
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model

        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)

    def training_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("train_loss", outputs["loss"])
        return outputs["loss"]  # this is passed to the optimizer for training

    def validation_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("val_loss", outputs["loss"], prog_bar=True)

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.val_acc(predicted_labels, batch["label"])
        self.log("val_acc", self.val_acc, prog_bar=True)

    def test_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.test_acc(predicted_labels, batch["label"])
        self.log("accuracy", self.test_acc, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer


lightning_model = LightningModel(model)

In [None]:
from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import CSVLogger


callbacks = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]
logger = CSVLogger(save_dir="logs/", name="my-model")

In [None]:
trainer = L.Trainer(
    max_epochs=3,
    callbacks=callbacks,
    accelerator="cpu",
    devices=1,
    logger=logger,
    log_every_n_steps=10,
)

trainer.fit(model=lightning_model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader)

INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: 
  | Name     | Type                                | Params | Mode 
-------------------------------------------------------------------------
0 | model    | DistilBertForSequenceClassification | 67.0 M | eval 
1 | val_acc  | MulticlassAccuracy                  | 0      | train
2 | test_acc | MulticlassAccuracy                  | 0      | train
-------------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
2         Modules in train mode
96        Modules in eval mode
INFO:lightn

Sanity Checking: |          | 0/? [00:00<?, ?it/s]



Training: |          | 0/? [00:00<?, ?it/s]

In [None]:
trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")

In [None]:
trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")

In [None]:
trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")