Transformers – Improving Natural Language Processing with Attention Mechanisms

28. Transformers – Improving Natural Language Processing with Attention Mechanisms#

This lecture is NOT going to be building Transformers from scratch. This will be a lecture on the evolution of Transformers from RNNs

!pip install transformers datasets lightning

Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.46.2)
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting lightning
  Downloading lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.23.2 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.26.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2024.9.11)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.32.3)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.5)
Requirement already satisfied: tokenizers<0.21,>=0.20 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.6)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.11.2)
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning)
  Downloading lightning_utilities-0.11.9-py3-none-any.whl.metadata (5.2 kB)
Requirement already satisfied: torch<4.0,>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from lightning) (2.5.1+cu121)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning)
  Downloading torchmetrics-1.6.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: typing-extensions<6.0,>=4.4.0 in /usr/local/lib/python3.10/dist-packages (from lightning) (4.12.2)
Collecting pytorch-lightning (from lightning)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (0.2.0)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.17.2)
Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from lightning-utilities<2.0,>=0.10.0->lightning) (75.1.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2024.8.30)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch<4.0,>=2.1.0->lightning) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch<4.0,>=2.1.0->lightning) (3.1.4)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch<4.0,>=2.1.0->lightning) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch<4.0,>=2.1.0->lightning) (1.3.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch<4.0,>=2.1.0->lightning) (3.0.2)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 9.2 MB/s eta 0:00:00
?25hDownloading lightning-2.4.0-py3-none-any.whl (810 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 811.0/811.0 kB 26.1 MB/s eta 0:00:00
?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 6.3 MB/s eta 0:00:00
?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 11.9 MB/s eta 0:00:00
?25hDownloading lightning_utilities-0.11.9-py3-none-any.whl (28 kB)
Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 8.4 MB/s eta 0:00:00
?25hDownloading torchmetrics-1.6.0-py3-none-any.whl (926 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 926.4/926.4 kB 31.9 MB/s eta 0:00:00
?25hDownloading pytorch_lightning-2.4.0-py3-none-any.whl (815 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 815.2/815.2 kB 33.3 MB/s eta 0:00:00
?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 12.8 MB/s eta 0:00:00
?25hInstalling collected packages: xxhash, lightning-utilities, fsspec, dill, multiprocess, torchmetrics, pytorch-lightning, datasets, lightning
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.10.0
    Uninstalling fsspec-2024.10.0:
      Successfully uninstalled fsspec-2024.10.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
Successfully installed datasets-3.1.0 dill-0.3.8 fsspec-2024.9.0 lightning-2.4.0 lightning-utilities-0.11.9 multiprocess-0.70.16 pytorch-lightning-2.4.0 torchmetrics-1.6.0 xxhash-3.5.0

from IPython.display import Image

28.1. Adding an attention mechanism to RNNs#

28.2. Attention helps RNNs with accessing information#

28.3. The original attention mechanism for RNNs#

28.4. Processing the inputs using a bidirectional RNN#

Bidirectional RNN #1:
- The first RNN (RNN #1) in the attention-based model is bidirectional, generating context vectors \( c_i \).
- A context vector is an enhanced version of the input vector \( x(i) \), incorporating information from the entire input sequence using an attention mechanism.
Processing the Sequence:
- RNN #1 processes the input sequence both forward (1 → \( T \)) and backward (\( T \) → 1).
- This bidirectional approach captures dependencies between current inputs and sequence elements that occur before and after.
Hidden State Construction:
- Each input element has two hidden states:
  - \( h_F(i) \): from the forward pass.
  - \( h_B(i) \): from the backward pass.
- These are concatenated to form a composite hidden state \( h(i) \).
  - Example: If \( h_F(i) \) and \( h_B(i) \) are 128-dimensional, the concatenated \( h(i) \) has 256 dimensions.
Purpose of Hidden States:
- The concatenated \( h(i) \) represents an annotation of the source word, containing context from both directions.
Role in Attention Mechanism:
- RNN #2 uses these context vectors prepared by RNN #1 to generate outputs, which will be explored further in the next section.

28.5. Generating outputs from context vectors#

RNN #2 Overview:
- RNN #2 is the main RNN responsible for generating the outputs.
- It receives context vectors \( c_i \) and hidden states as inputs.
Context Vector Computation:
- A context vector \( c_i \) is a weighted sum of the concatenated hidden states \( h(1), h(2), ..., h(T) \) from RNN #1.
- Formula:
  \( c_i = \sum_{j=1}^T \alpha_{ij} h(j) \)
- \( \alpha_{ij} \): Attention weights, indicating the importance of input sequence element \( j \) for the current output \( i \).
- Each output \( i \) has a unique set of attention weights.
Inputs to RNN #2:
- At each time step \( i \), RNN #2 takes:
  - The context vector \( c_i \).
  - The previous hidden state \( s(i-1) \).
  - The previous target word \( y(i-1) \) (during training).
Training vs. Inference:
- During training: The correct word \( y(i-1) \) is fed into the next state.
- During inference: The predicted output \( o(i-1) \) is used instead.
Output Generation:
- Using the inputs above, RNN #2 generates the predicted output \( o(i) \) for the target word \( y(i) \).
Key Takeaways:
- RNN #1: Prepares context vectors using attention over the input sequence.
- RNN #2: Combines context vectors, hidden states, and previous outputs to generate predictions.
- The computation of attention weights \( \alpha_{ij} \) will be discussed in the next section.

28.6. Computing the attention weights#

Attention Weights:
- Attention weights \( \alpha_{ij} \) connect inputs (annotations) to outputs (contexts).
- Subscripts:
  - \( j \): Input index.
  - \( i \): Output index.
- Alignment Scores:
  - Attention weights are normalized alignment scores \( e_{ij} \).
  - Formula:
    \( \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^T \exp(e_{ik})} \)
    - Similar to the softmax function, ensuring \( \sum_{j=1}^T \alpha_{ij} = 1 \).
Purpose of Attention Weights:
- Evaluate how well input at position \( j \) aligns with output at position \( i \).
- Guide the model in emphasizing relevant input elements when generating outputs.
Three Key Components of the Attention-Based RNN:
1. Input Annotation (RNN #1):
  - Computes bidirectional annotations of the input sequence.
2. Recurrent Block (RNN #2):
  - Similar to a regular RNN but uses context vectors instead of raw inputs.
3. Attention Mechanism:
  - Computes attention weights and context vectors, linking each input-output pair.
Comparison with Transformer Models:
- Transformers do not use recurrence.
- Instead, they rely solely on a self-attention mechanism.
- Process the entire input sequence at once, unlike the step-by-step processing in RNNs.
Next Steps:
- Explore the self-attention mechanism, a foundation for the transformer architecture.

28.7. Introducing the self-attention mechanism#

28.8. Starting with a basic form of self-attention#

Assume we have an input sentence that we encoded via a dictionary, which maps the words to integers as discussed in the RNN chapter:

import torch


# input sequence / sentence:
#  "Can you help me to translate this sentence"

sentence = torch.tensor(
    [0, # can
     7, # you
     1, # help
     2, # me
     5, # to
     6, # translate
     4, # this
     3] # sentence
)

sentence

tensor([0, 7, 1, 2, 5, 6, 4, 3])

Next, assume we have an embedding of the words, i.e., the words are represented as real vectors.
Since we have 8 words, there will be 8 vectors. Each vector is 16-dimensional:

torch.manual_seed(123)
embed = torch.nn.Embedding(10, 16)
embedded_sentence = embed(sentence).detach()
embedded_sentence

tensor([[ 3.3737e-01, -1.7778e-01, -3.0353e-01, -5.8801e-01,  3.4861e-01,
          6.6034e-01, -2.1964e-01, -3.7917e-01,  7.6711e-01, -1.1925e+00,
          6.9835e-01, -1.4097e+00,  1.7938e-01,  1.8951e+00,  4.9545e-01,
          2.6920e-01],
        [-9.4053e-01, -4.6806e-01,  1.0322e+00, -2.8300e-01,  4.9275e-01,
         -1.4078e-02, -2.7466e-01, -7.6409e-01,  1.3966e+00, -9.9491e-01,
         -1.5822e-03,  1.2471e+00, -7.7105e-02,  1.2774e+00, -1.4596e+00,
         -2.1595e+00],
        [-7.7020e-02, -1.0205e+00, -1.6896e-01,  9.1776e-01,  1.5810e+00,
          1.3010e+00,  1.2753e+00, -2.0095e-01,  4.9647e-01, -1.5723e+00,
          9.6657e-01, -1.1481e+00, -1.1589e+00,  3.2547e-01, -6.3151e-01,
         -2.8400e+00],
        [-1.3250e+00,  1.7843e-01, -2.1338e+00,  1.0524e+00, -3.8848e-01,
         -9.3435e-01, -4.9914e-01, -1.0867e+00,  8.8054e-01,  1.5542e+00,
          6.2662e-01, -1.7549e-01,  9.8284e-02, -9.3507e-02,  2.6621e-01,
         -5.8504e-01],
        [ 2.5529e-01, -5.4963e-01,  1.0042e+00,  8.2723e-01, -3.9481e-01,
          4.8923e-01, -2.1681e-01, -1.7472e+00, -1.6025e+00, -1.0764e+00,
          9.0315e-01, -7.2184e-01, -5.9508e-01, -7.1122e-01,  6.2296e-01,
         -1.3729e+00],
        [-2.2150e+00, -1.3193e+00, -2.0915e+00,  9.6285e-01, -3.1861e-02,
         -4.7896e-01,  7.6681e-01,  2.7468e-02,  1.9929e+00,  1.3708e+00,
         -5.0087e-01, -2.7928e-01, -2.0628e+00,  6.3745e-03, -9.8955e-01,
          7.0161e-01],
        [ 5.1463e-01,  9.9376e-01, -2.5873e-01, -1.0826e+00, -4.4382e-02,
          1.6236e+00, -2.3229e+00,  1.0878e+00,  6.7155e-01,  6.9330e-01,
         -9.4872e-01, -7.6507e-02, -1.5264e-01,  1.1674e-01,  4.4026e-01,
         -1.4465e+00],
        [ 8.7684e-01,  1.6221e+00, -1.4779e+00,  1.1331e+00, -1.2203e+00,
          1.3139e+00,  1.0533e+00,  1.3881e-01,  2.2473e+00, -8.0364e-01,
         -2.8084e-01,  7.6968e-01, -6.5956e-01, -7.9793e-01,  1.8383e-01,
          2.2935e-01]])

The goal is to compute the context vectors \(\boldsymbol{z}^{(i)}=\sum_{j=1}^{T} \alpha_{i j} \boldsymbol{x}^{(j)}\), which involve attention weights \(\alpha_{i j}\).
In turn, the attention weights \(\alpha_{i j}\) involve the \(\omega_{i j}\) values
Let’s start with the \(\omega_{i j}\)’s first, which are computed as dot-products:

\[\omega_{i j}=\boldsymbol{x}^{(i)^{\top}} \boldsymbol{x}^{(j)}\]

omega = torch.empty(8, 8)

for i, x_i in enumerate(embedded_sentence):
    for j, x_j in enumerate(embedded_sentence):
        omega[i, j] = torch.dot(x_i, x_j)

Actually, let’s compute this more efficiently by replacing the nested for-loops with a matrix multiplication:

omega_mat = embedded_sentence.matmul(embedded_sentence.T)

torch.allclose(omega_mat, omega)

True

Next, let’s compute the attention weights by normalizing the “omega” values so they sum to 1

\[\alpha_{i j}=\frac{\exp \left(\omega_{i j}\right)}{\sum_{j=1}^{T} \exp \left(\omega_{i j}\right)}=\operatorname{softmax}\left(\left[\omega_{i j}\right]_{j=1 \ldots T}\right)\]

\[\sum_{j=1}^{T} \alpha_{i j}=1\]

import torch.nn.functional as F

attention_weights = F.softmax(omega, dim=1)
attention_weights.shape

torch.Size([8, 8])

We can confirm that the columns sum up to one:

attention_weights.sum(dim=1)

Now that we have the attention weights, we can compute the context vectors \(\boldsymbol{z}^{(i)}=\sum_{j=1}^{T} \alpha_{i j} \boldsymbol{x}^{(j)}\), which involve attention weights \(\alpha_{i j}\)
For instance, to compute the context-vector of the 2nd input element (the element at index 1), we can perform the following computation:

x_2 = embedded_sentence[1, :]
context_vec_2 = torch.zeros(x_2.shape)
for j in range(8):
    x_j = embedded_sentence[j, :]
    context_vec_2 += attention_weights[1, j] * x_j

context_vec_2.shape

torch.Size([16])

Or, more effiently, using linear algebra and matrix multiplication:

context_vectors = torch.matmul(
        attention_weights, embedded_sentence)


torch.allclose(context_vec_2, context_vectors[1])

True

28.9. Parameterizing the self-attention mechanism: scaled dot-product attention#

torch.manual_seed(123)

d = embedded_sentence.shape[1]
U_query = torch.rand(d, d)
U_key = torch.rand(d, d)
U_value = torch.rand(d, d)

U_query.shape, U_key.shape, U_value.shape

(torch.Size([16, 16]), torch.Size([16, 16]), torch.Size([16, 16]))

x_2 = embedded_sentence[1]
query_2 = U_query.matmul(x_2)

key_2 = U_key.matmul(x_2)
value_2 = U_value.matmul(x_2)

keys = U_key.matmul(embedded_sentence.T).T
torch.allclose(key_2, keys[1])

True

values = U_value.matmul(embedded_sentence.T).T
torch.allclose(value_2, values[1])

True

omega_23 = query_2.dot(keys[2])
omega_23

tensor(14.3667)

omega_2 = query_2.matmul(keys.T)
omega_2

tensor([-25.1623,   9.3602,  14.3667,  32.1482,  53.8976,  46.6626,  -1.2131,
        -32.9392])

attention_weights_2 = F.softmax(omega_2 / d**0.5, dim=0)
attention_weights_2

tensor([2.2317e-09, 1.2499e-05, 4.3696e-05, 3.7242e-03, 8.5596e-01, 1.4026e-01,
        8.8897e-07, 3.1935e-10])

#context_vector_2nd = torch.zeros(values[1, :].shape)
#for j in range(8):
#    context_vector_2nd += attention_weights_2[j] * values[j, :]

#context_vector_2nd

context_vector_2 = attention_weights_2.matmul(values)
context_vector_2.shape

torch.Size([16])

28.10. Attention is all we need: introducing the original transformer architecture#

28.10.1. Encoding context embeddings via multi-head attention#

torch.manual_seed(123)

d = embedded_sentence.shape[1]
one_U_query = torch.rand(d, d)

h = 8
multihead_U_query = torch.rand(h, d, d)
multihead_U_key = torch.rand(h, d, d)
multihead_U_value = torch.rand(h, d, d)

multihead_query_2 = multihead_U_query.matmul(x_2)
multihead_query_2.shape

torch.Size([8, 16])

multihead_key_2 = multihead_U_key.matmul(x_2)
multihead_value_2 = multihead_U_value.matmul(x_2)

multihead_key_2[2]

tensor([-0.8076, -0.8896, -0.3299, -2.8832,  1.6619, -0.1659, -2.3260, -0.7933,
         0.4116, -0.5432, -0.7560, -0.6419, -1.6343, -1.5659,  0.7444, -2.5398])

stacked_inputs = embedded_sentence.T.repeat(8, 1, 1)
stacked_inputs.shape

torch.Size([8, 16, 8])

multihead_keys = torch.bmm(multihead_U_key, stacked_inputs)
multihead_keys.shape

torch.Size([8, 16, 8])

multihead_keys = multihead_keys.permute(0, 2, 1)
multihead_keys.shape

torch.Size([8, 8, 16])

multihead_keys[2, 1] # index: [2nd attention head, 2nd key]

tensor([-0.8076, -0.8896, -0.3299, -2.8832,  1.6619, -0.1659, -2.3260, -0.7933,
         0.4116, -0.5432, -0.7560, -0.6419, -1.6343, -1.5659,  0.7444, -2.5398])

multihead_values = torch.matmul(multihead_U_value, stacked_inputs)
multihead_values = multihead_values.permute(0, 2, 1)

multihead_z_2 = torch.rand(8, 16)

linear = torch.nn.Linear(8*16, 16)
context_vector_2 = linear(multihead_z_2.flatten())
context_vector_2.shape

torch.Size([16])

28.11. Learning a language model: decoder and masked multi-head attention#

28.12. Implementation details: positional encodings and layer normalization#

28.13. Positional Encoding#

Transformers process sequences in parallel (all tokens at once) and do not have an inherent sense of the order of tokens. To provide information about the position of each token in the sequence, positional encodings are added to the input embeddings.

The equations define the sinusoidal positional encoding scheme used in the original Transformer paper (“Attention Is All You Need”).

The positional encoding for each token \(i\) at each embedding dimension \(k\) is defined as:

For even indices (\(2k\)): \( PE(i, 2k) = \sin\left(\frac{i}{10000^{\frac{2k}{d_{\text{model}}}}}\right) \)
For odd indices (\(2k+1\)): \( PE(i, 2k+1) = \cos\left(\frac{i}{10000^{\frac{2k}{d_{\text{model}}}}}\right) \)

Where:

\(i\): The position of the token in the sequence.
\(k\): The dimension index within the embedding.
\(d_{\text{model}}\): The dimensionality of the embedding space (e.g., 512 or 1024).
\(10000\): A scaling factor to spread the values over a wider range.

28.14. Intuition Behind the Sinusoidal Function#

The sinusoidal functions (\(\sin\) and \(\cos\)) encode token positions in a way that preserves the relative order of tokens and allows the model to infer distances between tokens.

28.14.1. Key Properties:#

Unique Representations:
- Each position \(i\) has a unique encoding across all embedding dimensions.
Relative Position Encoding:
- The difference between positions \(i\) and \(j\) is encoded in a way that is interpretable by the model.
- This is because \(\sin(a + b)\) and \(\cos(a + b)\) have well-defined mathematical relationships.

28.15. Shape of Positional Encoding#

If the input sequence has \(n\) tokens, and the embedding dimension is \(d_{\text{model}}\), the positional encoding matrix has the shape: \( [n \times d_{\text{model}}] \) Each row represents the positional encoding for a specific token.

28.16. Why Use Both \(\sin\) and \(\cos\)?#

Alternating between \(\sin\) and \(\cos\) allows the model to encode different aspects of the position at different frequencies.
\(\sin\) and \(\cos\) functions with different wavelengths allow the encoding to represent both local and global positions effectively.

28.17. 6. Example Calculation#

Suppose:

Position \(i = 2\),
Embedding dimension \(d_{\text{model}} = 512\),
Dimension index \(k = 0\).

For the even index (\(2k = 0\)): \( PE(2, 0) = \sin\left(\frac{2}{10000^{\frac{0}{512}}}\right) = \sin(2) \)

For the odd index (\(2k+1 = 1\)): \( PE(2, 1) = \cos\left(\frac{2}{10000^{\frac{0}{512}}}\right) = \cos(2) \)

This continues for all dimensions and positions.

28.18. 7. Why Sinusoidal Positional Encoding?#

The authors of the Transformer model chose this approach because:

It is deterministic and does not require learning additional parameters (unlike learned positional embeddings).
It generalizes well to sequences longer than those seen during training.

28.19. 8. How Positional Encodings Are Used#

The positional encodings are added directly to the token embeddings before they are fed into the Transformer: \( \text{Input to Transformer} = \text{Token Embedding} + \text{Positional Encoding} \)

import torch

# Define sentence and embedding dimensions
sentence = torch.tensor([
    0,  # can
    7,  # you
    1,  # help
    2,  # me
    5,  # to
    6,  # translate
    4,  # this
    3   # sentence
])

sequence_length = len(sentence)  # Number of tokens
embedding_size = 16  # Embedding dimension size

# Generate positional indices for each token in the sequence
positions = torch.arange(sequence_length, dtype=torch.float32).unsqueeze(1)  # Shape: [sequence_length, 1]

# Generate a range of dimensions for the embeddings
dims = torch.arange(embedding_size, dtype=torch.float32).unsqueeze(0)  # Shape: [1, embedding_size]

# Compute the positional encodings using matrix operations
div_term = 10000 ** (dims // 2 * 2 / embedding_size)  # Divisors for scaling
positional_encodings = torch.zeros(sequence_length, embedding_size)  # Placeholder

positional_encodings[:, 0::2] = torch.sin(positions / div_term[:, 0::2])  # Apply sin for even indices
positional_encodings[:, 1::2] = torch.cos(positions / div_term[:, 1::2])  # Apply cos for odd indices

positional_encodings.shape  # Display the positional encodings

torch.Size([8, 16])

28.20. Batch and Layer Normalization#

28.21. Building large-scale language models by leveraging unlabeled data#

28.22. Pre-training and fine-tuning transformer models#

# Download the text file
!wget https://www.gutenberg.org/cache/epub/1268/pg1268.txt -O mysterious_island.txt

--2024-12-05 18:52:38--  https://www.gutenberg.org/cache/epub/1268/pg1268.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1171514 (1.1M) [text/plain]
Saving to: ‘mysterious_island.txt’

mysterious_island.t 100%[===================>]   1.12M  2.93MB/s    in 0.4s    

2024-12-05 18:52:39 (2.93 MB/s) - ‘mysterious_island.txt’ saved [1171514/1171514]

# Load the text into memory
with open("mysterious_island.txt", "r", encoding="utf-8") as file:
    text = file.read()

# Displaty the lenth of characters
print(f"Length of characters: {len(text)}")
# Display the first 500 characters
print ("Raw text")
print(text[:1000])

print ("\n\n------------------\n\n")

# Extract the main content
start_indx = text.find('THE MYSTERIOUS ISLAND')
end_indx = text.find('End of the Project Gutenberg')
text = text[start_indx:end_indx].lower()

# Display cleaned text
print ("Cleaned Text\n\n")
print(text[:1000])

Length of characters: 1131520
Raw text
﻿The Project Gutenberg eBook of The Mysterious Island
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Mysterious Island

Author: Jules Verne

Release date: April 1, 1998 [eBook #1268]
                Most recently updated: October 29, 2024

Language: English

Credits: Anthony Matonak

*** START OF THE PROJECT GUTENBERG EBOOK THE MYSTERIOUS ISLAND ***

THE MYSTERIOUS ISLAND

by Jules Verne

1874

PART 1--DROPPED FROM THE CLOUDS

Chapter 1

“Are we rising again?” “No. On the contrary.” “Are we descending?”
 “Worse than that, captain! we are falling!” “For Heaven’s sake heave ou

------------------

Cleaned Text

the mysterious island ***

the mysterious island

by jules verne

1874

part 1--dropped from the clouds

chapter 1

“are we rising again?” “no. on the contrary.” “are we descending?”
 “worse than that, captain! we are falling!” “for heaven’s sake heave out
the ballast!” “there! the last sack is empty!” “does the balloon rise?”
 “no!” “i hear a noise like the dashing of waves. the sea is below the
car! it cannot be more than 500 feet from us!” “overboard with every
weight! ... everything!”

such were the loud and startling words which resounded through the air,
above the vast watery desert of the pacific, about four o’clock in the
evening of the 23rd of march, 1865.

few can possibly have forgotten the terrible storm from the northeast,
in the middle of the equinox of that year. the tempest raged without
intermission from the 18th to the 26th of march. its ravages were
terrible in america, europe, and asia, covering a distance of eighteen
hundred miles, and extending obliquely to t

28.23. How can we now tokenize these words?#

The concept of tokenization forms the foundation of all modern natural language processing (NLP) systems, including OpenAI’s models, BERT, GPT, and others. Tokenization involves breaking down text into smaller units (tokens) that the model can process. The choice of tokenization method directly impacts the model’s performance, efficiency, and applicability.

Here’s a detailed explanation of tokenization and the ideas behind it:

28.24. 1. The Base Idea Behind Tokenization#

At its core, tokenization is about representing text in a numerical format suitable for machine learning models. The primary goals are:

To encode text into manageable and meaningful units.
To balance vocabulary size and sequence length for efficient processing.
To preserve linguistic features like word boundaries, subword structure, or characters.

28.25. 2. Tokenization Methods#

Tokenization methods have evolved to address the challenges of representing diverse languages and large vocabularies:

28.25.1. A. Word-Level Tokenization#

Definition: Splits text into words based on spaces or punctuation.
Example:
- Text: "The mysterious island"
- Tokens: ["The", "mysterious", "island"]
Advantages:
- Simple and intuitive.
- Works well for English and other space-delimited languages.
Disadvantages:
- Large vocabulary size.
- Poor handling of rare or out-of-vocabulary (OOV) words.

28.25.2. B. Character-Level Tokenization#

Definition: Breaks text into individual characters.
Example:
- Text: "The"
- Tokens: ["T", "h", "e"]
Advantages:
- Very small vocabulary.
- No OOV issues.
Disadvantages:
- Long sequence lengths.
- Loses higher-level word meanings.

28.25.3. C. Subword Tokenization#

Definition: Breaks text into subword units (e.g., prefixes, roots, suffixes). Commonly used in BERT and GPT models.
Example:
- Text: "mysterious"
- Tokens: ["myst", "er", "ious"]
Advantages:
- Compact vocabulary.
- Handles rare words effectively.
- Preserves linguistic structure.
Algorithms:
- Byte Pair Encoding (BPE): Used in GPT models.
- WordPiece: Used in BERT.
- Unigram Language Model: Used in T5 and SentencePiece.

28.25.4. D. Byte-Level Tokenization#

Definition: Represents text as raw bytes rather than characters or words. Used in OpenAI’s models.
Example:
- Text: "hello"
- Tokens: Byte values like [104, 101, 108, 108, 111].
Advantages:
- Universal: Handles any text or language.
- Compact vocabulary (256 bytes).
Disadvantages:
- Longer sequences for human-readable text.

28.26. 3. Tokenization in Popular Models#

28.26.1. BERT#

Uses WordPiece tokenization.
Balances vocabulary size and sequence length by splitting rare words into subwords.
Example:
- Text: "unbelievable"
- Tokens: ["un", "##believable"] (## indicates a subword).

28.26.2. OpenAI GPT (GPT-2 and GPT-3)#

Uses Byte Pair Encoding (BPE).
Merges frequently co-occurring byte pairs to reduce vocabulary size while retaining efficiency.
Example:
- Text: "running"
- Tokens: ["run", "ning"].

28.26.3. OpenAI’s CLIP and Codex#

Uses byte-level BPE.
Processes text at the byte level, enabling seamless handling of multilingual text and special characters.

28.27. 4. Why Tokenization Matters#

Vocabulary Size:
- Larger vocabularies improve accuracy but increase memory requirements.
- Smaller vocabularies reduce complexity but may lose linguistic nuances.
Sequence Length:
- Tokenization affects how long the sequences are, directly impacting model efficiency.
Handling Rare Words:
- Subword and byte-level tokenizers address rare and OOV words better than word-level tokenizers.

28.28. 5. Challenges in Tokenization#

Multilingual Tokenization:
- Handling languages with no spaces (e.g., Chinese, Japanese).
- Combining tokenization strategies for diverse languages.
Context Preservation:
- Breaking words or phrases incorrectly may lose context.
- Example: "New York" vs. "New" + “York”`.
Efficiency vs. Accuracy:
- Striking a balance between compact representations and linguistic fidelity.

28.29. 6. Quick references for tokenization#

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
NLTK (Natural Language Toolkit) – for NLP tasks in general
spaCy – for NLP tasks in general
Hugging Face Transformers – Specific for transformer based applications
Hugging Face Tokenizers – Specific for transformer based applications
SentencePiece – Used for BERT/T4 and GPT

# lets do a simple regex based tokenization. This is just a start.
# Main drawback here is this method cannot "make" new words

import re

sample_text = "Hello, world! Let's extract words: like 123 or _underscore_"
words = re.findall(r'\b\w+\b', sample_text)

print(words)

['Hello', 'world', 'Let', 's', 'extract', 'words', 'like', '123', 'or', '_underscore_']

# Tokenize text
import re
def word_tokenizer(text):
    words = re.findall(r'\b\w+\b', text)
    word_set = sorted(set(words))
    word2id = {word: i for i, word in enumerate(word_set, start=4)}  # Reserve 0-3 for special tokens
    id2word = {i: word for word, i in word2id.items()}

    # Add special tokens
    word2id["<pad>"], word2id["<sos>"], word2id["<eos>"], word2id["<unk>"] = 0, 1, 2, 3
    id2word[0], id2word[1], id2word[2], id2word[3] = "<pad>", "<sos>", "<eos>", "<unk>"

    return word2id, id2word

# Encode text into sequences of IDs
def encode_text(text, word2id):
    words = re.findall(r'\b\w+\b', text)
    return [word2id.get(word, word2id["<unk>"]) for word in words]

word2id, id2word = word_tokenizer(text)

for k, v in word2id.items():
    if len(k) < 2:
        print (k, v)

print ("VOCABULARY SIZE IS :", len(word2id))

1 5
2 52
3 71
4 84
5 88
6 96
7 103
8 107
9 112
a 115
b 746
c 1225
d 2194
e 2767
f 3242
i 4360
m 5233
o 5873
s 7533
t 8720
u 9204
w 9542
x 9851
VOCABULARY SIZE IS : 9902

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, text_data, word2id, seq_len_input=10, seq_len_target=10):
        self.word2id = word2id
        self.seq_len_input = seq_len_input
        self.seq_len_target = seq_len_target
        self.data = self._prepare_data(text_data)

    def _prepare_data(self, text_data):
        encoded_text = encode_text(text_data.lower(), self.word2id)
        input_target_pairs = []

        for i in range(0, len(encoded_text) - self.seq_len_input - 1, self.seq_len_input):
            # Extract input sequence
            input_seq = encoded_text[i : i + self.seq_len_input]
            # Extract target sequence (add <sos> and <eos>)
            target_seq = encoded_text[i + 1 : i + self.seq_len_target + 2]
            target_seq = [self.word2id["<sos>"]] + target_seq[:self.seq_len_target] + [self.word2id["<eos>"]]
            input_target_pairs.append((input_seq, target_seq))

        return input_target_pairs

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input_seq, target_seq = self.data[idx]
        # Convert to tensors
        input_seq = torch.tensor(input_seq, dtype=torch.long)
        target_seq = torch.tensor(target_seq, dtype=torch.long)
        return input_seq, target_seq

# Example Usage
dataset = TextDataset(text, word2id, 64, 64)

# Dataloader for batching
dataloader = DataLoader(dataset, batch_size=62, shuffle=True)

# Check the first batch
for batch in dataloader:
    inputs, targets = batch
    print("Input Shape:", inputs.shape)  # (batch_size, seq_len_input)
    print("Target Shape:", targets.shape)  # (batch_size, seq_len_target)
    print ("Input Sequence:", inputs[0])
    print ("Target Sequence:", targets[0])

    for seq in inputs[0]:
        print (id2word[seq.item()])
    print ("---")
    for seq in targets[0]:
        print (id2word[seq.item()])
    break

Input Shape: torch.Size([62, 64])
Target Shape: torch.Size([62, 66])
Input Sequence: tensor([9590, 4434, 8972,  613, 8855, 1431,  932, 8855, 7579,  439, 7091, 8854,
        1191, 5498,  828, 4934, 2673, 7972, 8895, 5141, 2083, 9590, 8972,  828,
        8855, 4070, 8872, 9661, 5842, 5606, 8848, 4028,  115, 5500, 3708, 8855,
        1636,  439, 4775, 9590, 5742, 8972, 8727, 8972,  844,  306, 8855, 9743,
        8855, 1024, 9590, 8861, 3885,  664,  115, 9458, 5562, 6966,  607, 8855,
        1117, 6181, 4675, 1221])
Target Sequence: tensor([   1, 4434, 8972,  613, 8855, 1431,  932, 8855, 7579,  439, 7091, 8854,
        1191, 5498,  828, 4934, 2673, 7972, 8895, 5141, 2083, 9590, 8972,  828,
        8855, 4070, 8872, 9661, 5842, 5606, 8848, 4028,  115, 5500, 3708, 8855,
        1636,  439, 4775, 9590, 5742, 8972, 8727, 8972,  844,  306, 8855, 9743,
        8855, 1024, 9590, 8861, 3885,  664,  115, 9458, 5562, 6966,  607, 8855,
        1117, 6181, 4675, 1221, 8855,    2])
was
important
to
ascertain
the
channels
between
the
sandbanks
and
reefs
that
buoys
might
be
laid
down
since
this
little
creek
was
to
be
the
harbor
they
were
not
more
than
half
a
mile
from
the
coast
and
it
was
necessary
to
tack
to
beat
against
the
wind
the
bonadventure
was
then
going
at
a
very
moderate
rate
as
the
breeze
partly
intercepted
by
---
<sos>
important
to
ascertain
the
channels
between
the
sandbanks
and
reefs
that
buoys
might
be
laid
down
since
this
little
creek
was
to
be
the
harbor
they
were
not
more
than
half
a
mile
from
the
coast
and
it
was
necessary
to
tack
to
beat
against
the
wind
the
bonadventure
was
then
going
at
a
very
moderate
rate
as
the
breeze
partly
intercepted
by
the
<eos>

import torch
import torch.nn as nn
import math

class Seq2SeqTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers, max_len):
        super(Seq2SeqTransformer, self).__init__()
        # Embedding layers for encoder and decoder
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, max_len)

        # Transformer Encoder
        self.encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_size,
            nhead=num_heads,
            dim_feedforward=hidden_dim
        )
        self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers)

        # Transformer Decoder
        self.decoder_layer = nn.TransformerDecoderLayer(
            d_model=embed_size,
            nhead=num_heads,
            dim_feedforward=hidden_dim
        )
        self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers)

        # Final Linear Layer
        self.fc_out = nn.Linear(embed_size, vocab_size)

    def forward(self, src, tgt, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask):
        # Encoder: Embed and add positional encoding
        src_embed = self.embedding(src) * math.sqrt(self.embedding.embedding_dim)
        src_embed = self.positional_encoding(src_embed)
        memory = self.encoder(src_embed, src_key_padding_mask=src_padding_mask)
        #print (f"Memory shape is : {memory.shape}")
        # Decoder: Embed and add positional encoding
        tgt_embed = self.embedding(tgt) * math.sqrt(self.embedding.embedding_dim)
        tgt_embed = self.positional_encoding(tgt_embed)
        output = self.decoder(
            tgt_embed,
            memory,
            tgt_mask=tgt_mask,
            memory_mask=src_mask,
            tgt_key_padding_mask=tgt_padding_mask,
            memory_key_padding_mask=src_padding_mask,
        )

        # Map to vocabulary size
        return self.fc_out(output)


class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, embed_size)
        positions = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2) * -(math.log(10000.0) / embed_size))
        self.encoding[:, 0::2] = torch.sin(positions * div_term)
        self.encoding[:, 1::2] = torch.cos(positions * div_term)
        self.encoding = self.encoding.unsqueeze(0)

    def forward(self, x):
        seq_len = x.size(1)
        return x + self.encoding[:, :seq_len, :].to(x.device)

# Define model parameters
VOCAB_SIZE = len(word2id)  # Example vocab size from tokenization
EMBED_SIZE = 64   # Embedding dimension
NUM_HEADS = 4     # Number of attention heads
HIDDEN_DIM = 256  # Feedforward network hidden size
NUM_LAYERS = 2    # Number of Transformer layers
MAX_LEN = 100     # Maximum sequence length

# Initialize the Transformer model
model = Seq2SeqTransformer(VOCAB_SIZE, EMBED_SIZE, NUM_HEADS, HIDDEN_DIM, NUM_LAYERS, MAX_LEN)

# Print the model summary
print(f"Vocabulary Size: {VOCAB_SIZE}")
print(model)

# Function to calculate the total number of trainable parameters
def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Example: Calculate for your model
total_params = count_trainable_parameters(model)
print(f"Total Trainable Parameters: {total_params}")

Vocabulary Size: 9902
Seq2SeqTransformer(
  (embedding): Embedding(9902, 64)
  (positional_encoding): PositionalEncoding()
  (encoder_layer): TransformerEncoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
    )
    (linear1): Linear(in_features=64, out_features=256, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=256, out_features=64, bias=True)
    (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
  )
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (linear1): Linear(in_features=64, out_features=256, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=256, out_features=64, bias=True)
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (decoder_layer): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
    )
    (linear1): Linear(in_features=64, out_features=256, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=256, out_features=64, bias=True)
    (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=False)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (linear1): Linear(in_features=64, out_features=256, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=256, out_features=64, bias=True)
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (fc_out): Linear(in_features=64, out_features=9902, bias=True)
)
Total Trainable Parameters: 1627566

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py:379: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
  warnings.warn(

# Source and Target Sequences
src = torch.randint(0, VOCAB_SIZE, (20, 10))  # (src_seq_len, batch_size)
tgt = torch.randint(0, VOCAB_SIZE, (100, 10))  # (tgt_seq_len, batch_size)

# Padding Masks
src_padding_mask = (src == 0).T  # Transpose to match shape (batch_size, src_seq_len)
tgt_padding_mask = (tgt == 0).T  # Transpose to match shape (batch_size, tgt_seq_len)

# Causal Mask for Decoder
tgt_seq_len = tgt.size(0)
tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_seq_len).to(tgt.device).bool()

# Forward Pass
output = model(src, tgt, None, tgt_mask, src_padding_mask, tgt_padding_mask)

print("Output shape:", output.shape)  # Should be (tgt_seq_len, batch_size, VOCAB_SIZE)

Output shape: torch.Size([100, 10, 9902])

def generate_text(model, prompt, max_len, word2id, id2word, device = torch.device("cpu")):
    model.eval()
    model.to(device)
    # Encode the prompt
    encoded_prompt = encode_text(prompt.lower(), word2id)
    src = torch.tensor(encoded_prompt, dtype=torch.long).unsqueeze(1).to(device)
    src_padding_mask = (src == word2id["<pad>"]).T.bool()

    # Initialize target sequence with <sos>
    tgt = torch.tensor([word2id["<sos>"]], dtype=torch.long).unsqueeze(1).to(device)
    generated = []

    for _ in range(max_len):
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt.size(0)).to(device)
        output = model(src, tgt, None, tgt_mask, src_padding_mask, None)
        next_token = output[-1, 0].argmax(-1).item()
        if next_token == word2id["<eos>"]:
            break
        generated.append(next_token)
        tgt = torch.cat([tgt, torch.tensor([[next_token]], device=device)], dim=0)

    return " ".join([id2word[token] for token in generated])

prompt = "Once upon a time there lived a "
generated_text = generate_text(model, prompt, max_len=128, word2id=word2id, id2word=id2word)
print("Generated Text: \n", generated_text)
generated_text = generate_text(model, prompt, max_len=128, word2id=word2id, id2word=id2word)
print("Generated Text: \n", generated_text)

Generated Text: 
 passes strait portrait returning hair furnaces vibration estates deductions chemicals suffered murdering pieces waistband flights wherever hastily indeed abilities disapppearing precincts irritate added successively endurance ulysses insignificant paws smart nice liked blue marks lit thistles shared cause damage gates remedies transforming wearied smart nice liked blue lockers piston terms wonted cave sink unluckily disapppearing precincts necessaries event other doses concocted occasional unprotected cases immovable frozen unlashed sugar unable beginning hypertext sanctify field respite limpid whirling william chlorine joyously imprudent jupiter review raft remains halliard cetacean cleave illness attacks later folds estates up review raft remains halliard cetacean nitrogen hinge gorges smaller as franklin acquisition those exclamation pictures postman unluckily disapppearing precincts greedily vinegar vertical outer grotto fowling prints cautiously pieces waistband fronting indicated yours silk descend harvey wooden
Generated Text: 
 passes strait portrait returning hair furnaces vibration estates deductions chemicals suffered murdering pieces waistband flights wherever hastily indeed abilities disapppearing precincts irritate added successively endurance ulysses insignificant paws smart nice liked blue marks lit thistles shared cause damage gates remedies transforming wearied smart nice liked blue lockers piston terms wonted cave sink unluckily disapppearing precincts necessaries event other doses concocted occasional unprotected cases immovable frozen unlashed sugar unable beginning hypertext sanctify field respite limpid whirling william chlorine joyously imprudent jupiter review raft remains halliard cetacean cleave illness attacks later folds estates up review raft remains halliard cetacean nitrogen hinge gorges smaller as franklin acquisition those exclamation pictures postman unluckily disapppearing precincts greedily vinegar vertical outer grotto fowling prints cautiously pieces waistband fronting indicated yours silk descend harvey wooden

import torch, torch.nn as nn

# Define optimizer and criterion
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=word2id["<pad>"])  # Ignore padding tokens

import torch.nn.functional as F
ignored_indices = [word2id["<pad>"]]
def entropy_loss(output, ignored_indices):
    """
    Computes entropy loss, excluding specified ignore indices.
    Args:
        output: Logits of shape (batch_size * seq_len, vocab_size)
        ignore_indices: List of token indices to ignore
    """
    # Apply ignore mask to logits
    mask = torch.ones(output.size(-1), device=output.device)
    for idx in ignored_indices:
        mask[idx] = 0  # Ignore specific indices
    mask = mask.unsqueeze(0)  # Expand for broadcasting

    # Apply mask and compute probabilities
    output = output * mask
    probs = F.softmax(output, dim=-1)
    log_probs = F.log_softmax(output, dim=-1)

    # Calculate entropy
    entropy = -torch.sum(probs * log_probs, dim=-1)

    # Exclude ignored tokens from averaging
    valid_mask = probs.sum(dim=-1) > 0  # Ensure no ignored tokens contribute
    return entropy[valid_mask].mean()


alpha = 0.001

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print ("Training is running on : ", device)
# Training loop
epochs = 500
start = 0
end = 500
assert(epochs == end - start)

model.train()

for epoch in range(start, end):
    total_loss = 0
    for idx, (src, tgt) in enumerate(dataloader):

        if (idx % 1000 == 0):
            print (f"done with {idx + 1} batches")
        # Move data to device
        src = src.to(device)
        tgt = tgt.to(device)

        # Adjust tensor shapes for Transformer
        src = src.T  # Shape: (src_seq_len, batch_size)
        tgt = tgt.T  # Shape: (tgt_seq_len + 2, batch_size) (including <sos> and <eos>)

        tgt_input = tgt[:-1, :]  # Input to the decoder excludes <eos>
        tgt_output = tgt[1:, :]  # Target output excludes <sos>

        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_input.size(0)).to(device)
        src_padding_mask = (src == word2id["<pad>"]).T.bool()
        tgt_padding_mask = (tgt_input == word2id["<pad>"]).T.bool()

        # Forward pass
        output = model(src, tgt_input, None, tgt_mask, src_padding_mask, tgt_padding_mask)

        # Compute loss
        loss = criterion(output.reshape(-1, VOCAB_SIZE), tgt_output.reshape(-1)) + alpha * entropy_loss(output, ignored_indices)
        total_loss += loss.item()

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1}/{epochs}, Loss: {total_loss / len(dataloader):.4f}")

    # Sentence generation at the end of each epoch
    model.eval()  # Switch to evaluation mode
    prompt = "Once upon a time, there lived a"  # Example prompt
    generated_text = generate_text(model, prompt.lower(), max_len=50, word2id=word2id, id2word=id2word, device=device)
    print(f"Epoch {epoch + 1} Generated Text: {generated_text}")
    model.train()  # Switch back to training mode

Training is running on :  cuda
done with 1 batches
Epoch 1/500, Loss: 4.3661
Epoch 1 Generated Text: a great means of a time there a great a great a great a great a great a great use there was a great a great a great use off there there a great a great a great use off there a great a great use there a great use
done with 1 batches
Epoch 2/500, Loss: 4.3135
Epoch 2 Generated Text: a time there there there there there a very evident that there there there a very evident that there there there a very evident that there there there a very evident that there there a great a very evident that there there there there there a very evident that there
done with 1 batches
Epoch 3/500, Loss: 4.2596
Epoch 3 Generated Text: a time there a great a very evident there a very evident that there a very evident there a very evident there a very evident there a very evident there a very evident that there a very evident there a very evident there a very evident there a very evident
done with 1 batches
Epoch 4/500, Loss: 4.2083
Epoch 4 Generated Text: a great means of a very evident there a very evident there a very evident there a very evident there a very evident there a very evident that there a great time there a great time there a great time there a great time there a great time there a
done with 1 batches
Epoch 5/500, Loss: 4.1630
Epoch 5 Generated Text: a great means of a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very
done with 1 batches
Epoch 6/500, Loss: 4.1166
Epoch 6 Generated Text: a time there there there there there a copy a copy a copy a copy there there there there there there a copy a copy a copy a copy a copy a copy there there there there there there there there there there there there there a copy a copy
done with 1 batches
Epoch 7/500, Loss: 4.0731
Epoch 7 Generated Text: a time there there there there there a very evident there a very there a very there a very there a very there a very there a very there a very there a very there a very there a very there a very there a very there a very there
done with 1 batches
Epoch 8/500, Loss: 4.0297
Epoch 8 Generated Text: a time there a copy a copy there a copy a copy there a copy a copy there there a copy a copy a any access there there a reasonable there a reasonable there a reasonable there a reasonable there a reasonable there a reasonable there a reasonable there a
done with 1 batches
Epoch 9/500, Loss: 3.9907
Epoch 9 Generated Text: a time there there there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a very evident there a
done with 1 batches
Epoch 10/500, Loss: 3.9505
Epoch 10 Generated Text: a time there a copy there a copy there a copy there a copy there a copy there a copy there there a copy there a copy there a copy there a copy there a copy there a copy there a copy there a copy there a copy there a
done with 1 batches
Epoch 11/500, Loss: 3.9119
Epoch 11 Generated Text: a time there there a copy there a copy there a copy there a copy there a copy there a great time there a copy a copy a great time there a copy a great time there a copy a great time there a copy a great time there a
done with 1 batches
Epoch 12/500, Loss: 3.8765
Epoch 12 Generated Text: a means of a copy there there there there a copy there there a copy there there a copy there there a copy there there a copy there there a copy there there a copy there there a copy there there a copy there there a copy there there a
done with 1 batches
Epoch 13/500, Loss: 3.8414
Epoch 13 Generated Text: a burial there there there a burial there there a burial there a burial there there a burial there a burial there a burial there there a burial there a burial there a burial there there a burial there a burial there a burial there there a burial there a
done with 1 batches
Epoch 14/500, Loss: 3.8070
Epoch 14 Generated Text: a burial there there a burial there there a burial there a burial there there a burial there a burial there there a burial there a reasonable there there there a copy there a copy there a copy there a copy there a copy there a copy there a reasonable
done with 1 batches
Epoch 15/500, Loss: 3.7708
Epoch 15 Generated Text: a burial there there there there there there a reasonable there there there there there a reasonable there there there there there a reasonable there there there there there a reasonable there there there there there a reasonable there there there there a reasonable there there there there there a
done with 1 batches
Epoch 16/500, Loss: 3.7443
Epoch 16 Generated Text: a burial there there there there a burial there a burial there there a copy there a copy there a copy there there a copy there a copy there there a copy there there a copy there there a copy there there a copy there there a copy there there
done with 1 batches
Epoch 17/500, Loss: 3.7124
Epoch 17 Generated Text: a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial there a burial
done with 1 batches
Epoch 18/500, Loss: 3.6849
Epoch 18 Generated Text: there a burial there there a reasonable there there there there there there a reasonable there there there there a reasonable there there there there there a reasonable there there there there a reasonable there there there there there a reasonable there there there there a reasonable there there there
done with 1 batches
Epoch 19/500, Loss: 3.6533
Epoch 19 Generated Text: there a burial there there there a burial there there a burial there there a burial there there a burial there a burial there that there a reasonable there there there there there there there there a reasonable there there there there there a reasonable there there there there a
done with 1 batches
Epoch 20/500, Loss: 3.6304
Epoch 20 Generated Text: there a waves been inhabited there there a reasonable there there a reasonable there there there a reasonable there there a reasonable there there there a reasonable there there a reasonable there there there a reasonable there there there a reasonable there there a reasonable there there there a reasonable
done with 1 batches
Epoch 21/500, Loss: 3.6049
Epoch 21 Generated Text: a burial there there there there a burial there a burial there that there a waves been a waves been a waves been a waves been inhabited or dashing there a waves been a waves been there a waves been there a waves been there a reasonable there a reasonable
done with 1 batches
Epoch 22/500, Loss: 3.5789
Epoch 22 Generated Text: a burial there there there there there a burial there there imagine there there imagine there there there imagine a waves been a burial there that there a waves been thrown on the rocks there a reasonable there a reasonable there a reasonable there a reasonable there a reasonable there
done with 1 batches
Epoch 23/500, Loss: 3.5588
Epoch 23 Generated Text: a burial there there there a understand there imagine there imagine there a john there imagine there imagine there imagine there imagine there a waves no no no no no no no no no no no no no no no no no no no no no no no no no
done with 1 batches
Epoch 24/500, Loss: 3.5346
Epoch 24 Generated Text: there a waves been inhabited there a waves been inhabited there a reasonable there a waves been inhabited unless there there a waves been inhabited unless there there imagine there a waves been inhabited unless a waves been inhabited there there there imagine there imagine there there a waves been
done with 1 batches
Epoch 25/500, Loss: 3.5115
Epoch 25 Generated Text: there a burial there a burial there imagine there that there imagine there imagine there imagine there a waves no no no no no no no no no no no no no no no no no no no no that there there there there there a waves there a waves
done with 1 batches
Epoch 26/500, Loss: 3.4894
Epoch 26 Generated Text: there a burial there a burial there a burial there a burial there a burial there a burial there a burial there imagine there a burial there imagine there there imagine there a waves there imagine there a waves been a waves there a waves there a waves there a
done with 1 batches
Epoch 27/500, Loss: 3.4686
Epoch 27 Generated Text: there a thorough the rocks there a thorough there a remark there a remark there there imagine there a reasonable there there there a reasonable there there there a reasonable there there a reasonable there there there a reasonable there there a reasonable there there there a reasonable there there
done with 1 batches
Epoch 28/500, Loss: 3.4492
Epoch 28 Generated Text: there a john mangles no no no no no that there there imagine there imagine there there imagine there there imagine there there there imagine there there imagine there there imagine there there imagine there there there imagine there there imagine there there imagine there there there imagine there there
done with 1 batches
Epoch 29/500, Loss: 3.4336
Epoch 29 Generated Text: there imagine a waves been inhabited there there imagine there there imagine there there there imagine there there imagine there there imagine there there imagine there there there imagine there there imagine there there imagine there there imagine there there there imagine there there imagine there there imagine there there
done with 1 batches
Epoch 30/500, Loss: 3.4145
Epoch 30 Generated Text: there a burial there imagine there there imagine there there there imagine there there imagine there there there imagine a waves been blown there there there imagine there imagine there there imagine there there imagine there imagine there there imagine there there imagine there there imagine there there imagine there
done with 1 batches
Epoch 31/500, Loss: 3.3959
Epoch 31 Generated Text: there imagine there there imagine there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there imagine there there imagine there there imagine there there
done with 1 batches
Epoch 32/500, Loss: 3.3775
Epoch 32 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 33/500, Loss: 3.3605
Epoch 33 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 34/500, Loss: 3.3466
Epoch 34 Generated Text: there imagine there a waves been inhabited there imagine there imagine there imagine there imagine there imagine there imagine there a waves been thrown there imagine there imagine there imagine there a waves there imagine there imagine there imagine there imagine there imagine there imagine there a waves there imagine
done with 1 batches
Epoch 35/500, Loss: 3.3252
Epoch 35 Generated Text: there imagine there a waves no understand there there there imagine there there there imagine there there there imagine there there there there imagine there there there imagine there there there imagine there there there there imagine there there there there imagine there there there imagine there there there there
done with 1 batches
Epoch 36/500, Loss: 3.3130
Epoch 36 Generated Text: there imagine there a burial there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there
done with 1 batches
Epoch 37/500, Loss: 3.2986
Epoch 37 Generated Text: there a understand there there there a understand there there imagine there there there imagine there there imagine there there imagine there there there imagine there there a waves no no no no no no no no no no no no no no no no no no no no no
done with 1 batches
Epoch 38/500, Loss: 3.2823
Epoch 38 Generated Text: there imagine there a thorough the rocks there imagine there imagine there there imagine there imagine there there imagine there imagine there there imagine there imagine there there imagine there there imagine there imagine there there imagine there imagine there there imagine there there imagine there imagine there there imagine
done with 1 batches
Epoch 39/500, Loss: 3.2700
Epoch 39 Generated Text: there imagine there a waves been blown there there there imagine there imagine there there imagine there imagine there there imagine there imagine there there imagine there there imagine there imagine there there imagine there imagine there there imagine there there imagine there imagine there there imagine there there imagine
done with 1 batches
Epoch 40/500, Loss: 3.2529
Epoch 40 Generated Text: there imagine there a burial there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there imagine there there there imagine there there imagine there there imagine there there imagine there there imagine there there there imagine there there imagine there there
done with 1 batches
Epoch 41/500, Loss: 3.2390
Epoch 41 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 42/500, Loss: 3.2282
Epoch 42 Generated Text: there a understand there imagine there there imagine there imagine there there imagine there there there imagine there there imagine there there imagine there there there imagine there there imagine there there there imagine there there imagine there there there imagine there there that there imagine there imagine there imagine
done with 1 batches
Epoch 43/500, Loss: 3.2183
Epoch 43 Generated Text: there imagine there a waves imagine there imagine there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there there imagine there imagine there imagine there there imagine there
done with 1 batches
Epoch 44/500, Loss: 3.2020
Epoch 44 Generated Text: there imagine there a waves imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 45/500, Loss: 3.1893
Epoch 45 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 46/500, Loss: 3.1782
Epoch 46 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 47/500, Loss: 3.1688
Epoch 47 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 48/500, Loss: 3.1572
Epoch 48 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 49/500, Loss: 3.1495
Epoch 49 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 50/500, Loss: 3.1325
Epoch 50 Generated Text: there imagine there that there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 51/500, Loss: 3.1230
Epoch 51 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 52/500, Loss: 3.1166
Epoch 52 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 53/500, Loss: 3.1007
Epoch 53 Generated Text: a thorough but there imagine this time there imagine this imagine this imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 54/500, Loss: 3.0929
Epoch 54 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 55/500, Loss: 3.0832
Epoch 55 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 56/500, Loss: 3.0788
Epoch 56 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 57/500, Loss: 3.0645
Epoch 57 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there that there imagine there imagine there imagine there
done with 1 batches
Epoch 58/500, Loss: 3.0546
Epoch 58 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 59/500, Loss: 3.0463
Epoch 59 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 60/500, Loss: 3.0388
Epoch 60 Generated Text: a thorough the rocks there imagine there imagine this imagine there imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 61/500, Loss: 3.0260
Epoch 61 Generated Text: a thorough the rocks there imagine there imagine there imagine this imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 62/500, Loss: 3.0185
Epoch 62 Generated Text: a thorough the rocks there imagine there imagine there imagine that there imagine this imagine the rocks there imagine there imagine there imagine there imagine that imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 63/500, Loss: 3.0090
Epoch 63 Generated Text: a thorough the rocks there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 64/500, Loss: 3.0059
Epoch 64 Generated Text: a access there imagine this imagine this imagine this imagine this imagine there imagine this imagine this imagine there imagine this imagine there imagine this imagine there imagine there imagine this imagine there imagine there imagine this imagine there imagine that there imagine this imagine this imagine there imagine there
done with 1 batches
Epoch 65/500, Loss: 2.9957
Epoch 65 Generated Text: there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 66/500, Loss: 2.9868
Epoch 66 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 67/500, Loss: 2.9802
Epoch 67 Generated Text: a thorough time there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 68/500, Loss: 2.9714
Epoch 68 Generated Text: a burial there imagine this imagine this imagine this imagine a burial there imagine this imagine this imagine there imagine this imagine a burial there imagine there imagine that there imagine this imagine this imagine a once once imagine this imagine this imagine this imagine this imagine this imagine that
done with 1 batches
Epoch 69/500, Loss: 2.9652
Epoch 69 Generated Text: once there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there imagine there imagine there there imagine there imagine there imagine there imagine there imagine there there
done with 1 batches
Epoch 70/500, Loss: 2.9551
Epoch 70 Generated Text: a access there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 71/500, Loss: 2.9464
Epoch 71 Generated Text: there imagine a thorough the rocks there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 72/500, Loss: 2.9388
Epoch 72 Generated Text: a thorough time there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 73/500, Loss: 2.9338
Epoch 73 Generated Text: once there imagine a thorough the rocks there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 74/500, Loss: 2.9239
Epoch 74 Generated Text: a thorough time there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine a reason there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 75/500, Loss: 2.9226
Epoch 75 Generated Text: a thorough time there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine a thorough the rocks there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 76/500, Loss: 2.9136
Epoch 76 Generated Text: once once once once once there imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 77/500, Loss: 2.9097
Epoch 77 Generated Text: once there imagine a burial there imagine them there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 78/500, Loss: 2.8998
Epoch 78 Generated Text: once there imagine a thorough the lift imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine that imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 79/500, Loss: 2.8929
Epoch 79 Generated Text: a visit extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there extinguished there
done with 1 batches
Epoch 80/500, Loss: 2.8874
Epoch 80 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 81/500, Loss: 2.8810
Epoch 81 Generated Text: a visit extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 82/500, Loss: 2.8766
Epoch 82 Generated Text: a visit extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 83/500, Loss: 2.8726
Epoch 83 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 84/500, Loss: 2.8629
Epoch 84 Generated Text: there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 85/500, Loss: 2.8595
Epoch 85 Generated Text: once imagine there imagine them once imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 86/500, Loss: 2.8502
Epoch 86 Generated Text: once once there imagine a burial there imagine them there imagine there imagine there imagine there imagine them there imagine a door there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 87/500, Loss: 2.8439
Epoch 87 Generated Text: a access there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 88/500, Loss: 2.8384
Epoch 88 Generated Text: a access there imagine there imagine extinguished there imagine extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 89/500, Loss: 2.8317
Epoch 89 Generated Text: once there imagine there imagine there imagine there imagine there extinguished there imagine there imagine there imagine there extinguished there imagine there extinguished there imagine there extinguished there imagine there imagine there extinguished there imagine there extinguished there imagine there extinguished there imagine there imagine there extinguished there imagine there
done with 1 batches
Epoch 90/500, Loss: 2.8252
Epoch 90 Generated Text: a access there imagine this time there imagine this imagine this imagine inevitable once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 91/500, Loss: 2.8211
Epoch 91 Generated Text: a access there imagine this imagine this imagine extinguished there imagine extinguished there imagine extinguished there imagine this imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine supposition imagine extinguished there imagine supposition imagine this imagine extinguished there imagine extinguished there imagine
done with 1 batches
Epoch 92/500, Loss: 2.8176
Epoch 92 Generated Text: a access there imagine this imagine extinguished there imagine this imagine extinguished there imagine extinguished there imagine extinguished there imagine supposition imagine this imagine extinguished there imagine extinguished there imagine extinguished there imagine supposition imagine this imagine extinguished there imagine extinguished there imagine supposition imagine extinguished there imagine supposition imagine
done with 1 batches
Epoch 93/500, Loss: 2.8162
Epoch 93 Generated Text: once imagine there imagine this imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there
done with 1 batches
Epoch 94/500, Loss: 2.8073
Epoch 94 Generated Text: a door extinguished there extinguished access there imagine them imagine extinguished there imagine them imagine extinguished there imagine extinguished there imagine them imagine extinguished there imagine extinguished there imagine them imagine there imagine there imagine
done with 1 batches
Epoch 95/500, Loss: 2.8019
Epoch 95 Generated Text: once a access there imagine this imagine this imagine them imagine this imagine this imagine them imagine this imagine a door there imagine this imagine them imagine there imagine there imagine a door there imagine extinguished there imagine extinguished there imagine
done with 1 batches
Epoch 96/500, Loss: 2.7950
Epoch 96 Generated Text: a access there imagine extinguished there imagine extinguished there imagine extinguished there imagine them imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there
done with 1 batches
Epoch 97/500, Loss: 2.7893
Epoch 97 Generated Text: once imagine them imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 98/500, Loss: 2.7851
Epoch 98 Generated Text: a thorough time there imagine them once imagine this imagine them once imagine this imagine this imagine this imagine there imagine them once imagine there imagine there imagine there imagine there imagine there imagine them once imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 99/500, Loss: 2.7808
Epoch 99 Generated Text: once imagine this imagine there imagine this imagine there imagine this imagine there imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there extinguished there imagine there imagine there imagine there imagine there extinguished there imagine
done with 1 batches
Epoch 100/500, Loss: 2.7749
Epoch 100 Generated Text: once once imagine this imagine this imagine this imagine this imagine them once imagine this imagine this imagine there imagine this imagine the lift there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine this imagine there imagine there imagine
done with 1 batches
Epoch 101/500, Loss: 2.7717
Epoch 101 Generated Text: a access there imagine supposition imagine supposition imagine this imagine there imagine there imagine this imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 102/500, Loss: 2.7661
Epoch 102 Generated Text: a access there imagine this time there imagine inevitable time there imagine them once imagine this time there imagine them once imagine this time there imagine this time there imagine this time there imagine them imagine supposition imagine supposition imagine this time there imagine them imagine supposition imagine supposition imagine
done with 1 batches
Epoch 103/500, Loss: 2.7599
Epoch 103 Generated Text: a granite rocks there imagine this time there imagine inevitable time there imagine inevitable time there imagine them once imagine them once imagine supposition unless a sheltered extinguished there imagine supposition unless a smile there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 104/500, Loss: 2.7525
Epoch 104 Generated Text: a access there imagine them imagine this imagine them imagine this imagine them imagine supposition imagine this imagine them imagine there imagine there imagine there imagine there imagine them imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 105/500, Loss: 2.7477
Epoch 105 Generated Text: a access there imagine this imagine this imagine this imagine this imagine this imagine this imagine this imagine them imagine this imagine this imagine this imagine supposition imagine this imagine them imagine supposition imagine this imagine supposition imagine this imagine supposition imagine this imagine supposition imagine this imagine there imagine
done with 1 batches
Epoch 106/500, Loss: 2.7474
Epoch 106 Generated Text: a granite rocks there imagine this imagine them imagine this imagine this imagine them imagine this imagine this imagine this imagine them imagine there imagine there imagine this imagine there imagine them imagine there imagine there imagine there imagine this imagine there imagine there imagine there imagine this imagine them
done with 1 batches
Epoch 107/500, Loss: 2.7393
Epoch 107 Generated Text: a access there imagine this imagine this imagine this imagine them imagine this imagine supposition imagine this imagine this imagine them imagine supposition imagine there imagine supposition imagine this imagine there imagine supposition imagine this imagine there imagine supposition imagine there imagine this imagine supposition imagine there imagine there imagine
done with 1 batches
Epoch 108/500, Loss: 2.7372
Epoch 108 Generated Text: a access there imagine them me there imagine this imagine them imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 109/500, Loss: 2.7301
Epoch 109 Generated Text: a visit extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine them there imagine them there imagine them there imagine them there imagine this time there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 110/500, Loss: 2.7275
Epoch 110 Generated Text: them a burial supposition unless there imagine supposition unless a burial supposition unless a burial supposition unless a burial supposition unless a burial supposition unless a burial supposition unless a burial supposition unless a burial supposition unless a burial supposition unless a burial supposition unless understand there imagine extinguished there
done with 1 batches
Epoch 111/500, Loss: 2.7223
Epoch 111 Generated Text: a burial supposition imagine inevitable supposition imagine inevitable supposition imagine them there imagine supposition imagine them there imagine supposition imagine supposition imagine them there imagine supposition imagine supposition imagine them there imagine supposition imagine supposition imagine a once imagine them there imagine them there imagine a gigantic imagine there imagine
done with 1 batches
Epoch 112/500, Loss: 2.7159
Epoch 112 Generated Text: a thorough time there imagine extinguished there imagine extinguished there imagine them once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 113/500, Loss: 2.7132
Epoch 113 Generated Text: a lake there extinguished there extinguished there extinguished there extinguished there imagine them once imagine them once imagine them once imagine them once imagine them once imagine them there imagine a smile there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 114/500, Loss: 2.7091
Epoch 114 Generated Text: a granite rocks there imagine them once once once there imagine upon our reason sunk there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 115/500, Loss: 2.7010
Epoch 115 Generated Text: a lake there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there
done with 1 batches
Epoch 116/500, Loss: 2.7015
Epoch 116 Generated Text: a lake there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished
done with 1 batches
Epoch 117/500, Loss: 2.6992
Epoch 117 Generated Text: them a lake there imagine this imagine inevitable cape there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine
done with 1 batches
Epoch 118/500, Loss: 2.6907
Epoch 118 Generated Text: once once once once once once once there imagine this imagine dakkar inhabited there extinguished there extinguished there extinguished there extinguished there imagine dakkar imagine there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there imagine there extinguished there extinguished there
done with 1 batches
Epoch 119/500, Loss: 2.6883
Epoch 119 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me that there
done with 1 batches
Epoch 120/500, Loss: 2.6853
Epoch 120 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me me me that there extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 121/500, Loss: 2.6798
Epoch 121 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me that there extinguished there extinguished there extinguished there extinguished there extinguished there imagine myself there
done with 1 batches
Epoch 122/500, Loss: 2.6758
Epoch 122 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me your inhabited there extinguished there extinguished there extinguished there extinguished there extinguished there imagine sentences extinguished there imagine sentences extinguished there imagine sentences extinguished there imagine sentences extinguished
done with 1 batches
Epoch 123/500, Loss: 2.6684
Epoch 123 Generated Text: a granite rocks there extinguished there imagine supposition unless a supposition unless a supposition unless a supposition unless once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 124/500, Loss: 2.6675
Epoch 124 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me that there extinguished there extinguished there
done with 1 batches
Epoch 125/500, Loss: 2.6662
Epoch 125 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me me me me me me that there extinguished there extinguished there extinguished there imagine sentences extinguished there imagine sentences extinguished there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 126/500, Loss: 2.6637
Epoch 126 Generated Text: once once once once once once once once once once once once once there imagine dakkar inhabited there extinguished there extinguished there extinguished there extinguished of a poultry supposition our time once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 127/500, Loss: 2.6553
Epoch 127 Generated Text: a lake there imagine them me me me me me me me me me that there extinguished there extinguished there imagine sentences extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 128/500, Loss: 2.6555
Epoch 128 Generated Text: them me a lake there extinguished there extinguished there extinguished there imagine myself once once once once once once once a smile there extinguished there imagine dakkar inhabited there imagine dakkar inhabited there imagine myself there extinguished there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 129/500, Loss: 2.6460
Epoch 129 Generated Text: a lake there extinguished there extinguished there extinguished there imagine upon there extinguished there imagine dakkar inhabited there extinguished there extinguished there extinguished there extinguished there imagine there extinguished there extinguished there imagine there extinguished there imagine there extinguished there imagine there extinguished there extinguished there imagine there extinguished there
done with 1 batches
Epoch 130/500, Loss: 2.6433
Epoch 130 Generated Text: a thorough time there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine them once once once once once once once once once once once once
done with 1 batches
Epoch 131/500, Loss: 2.6422
Epoch 131 Generated Text: upon a clear there extinguished there extinguished there extinguished there extinguished there extinguished extinguished there extinguished there extinguished extinguished there extinguished extinguished there extinguished extinguished there extinguished there extinguished extinguished there extinguished extinguished there extinguished there extinguished extinguished there extinguished there extinguished extinguished there extinguished extinguished there extinguished there extinguished
done with 1 batches
Epoch 132/500, Loss: 2.6421
Epoch 132 Generated Text: a burial once once once once once once once once once once once there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished extinguished there extinguished extinguished extinguished
done with 1 batches
Epoch 133/500, Loss: 2.6343
Epoch 133 Generated Text: a door extinguished there extinguished there extinguished think there extinguished there extinguished there extinguished there imagine them once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 134/500, Loss: 2.6316
Epoch 134 Generated Text: a lake there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there
done with 1 batches
Epoch 135/500, Loss: 2.6266
Epoch 135 Generated Text: a lake there imagine them me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me that there extinguished there extinguished there extinguished there extinguished
done with 1 batches
Epoch 136/500, Loss: 2.6231
Epoch 136 Generated Text: upon a lake there extinguished there extinguished there imagine dakkar inhabited there extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there
done with 1 batches
Epoch 137/500, Loss: 2.6185
Epoch 137 Generated Text: upon a smile there extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 138/500, Loss: 2.6199
Epoch 138 Generated Text: upon a burial supposition perceived there extinguished there imagine them me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me me
done with 1 batches
Epoch 139/500, Loss: 2.6094
Epoch 139 Generated Text: a burial supposition a burial supposition unless once once once once once once once once once once once once once once there extinguished there imagine supposition imagine them there imagine supposition imagine them there imagine supposition imagine them there imagine supposition imagine them there imagine a supposition imagine supposition imagine
done with 1 batches
Epoch 140/500, Loss: 2.6097
Epoch 140 Generated Text: once once once a lake there imagine upon there imagine this imagine this imagine them there imagine this imagine a sea there imagine there imagine there imagine there extinguished there imagine there extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine
done with 1 batches
Epoch 141/500, Loss: 2.6071
Epoch 141 Generated Text: a lake there extinguished there extinguished there extinguished extinguished extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 142/500, Loss: 2.6040
Epoch 142 Generated Text: a lake there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished
done with 1 batches
Epoch 143/500, Loss: 2.5977
Epoch 143 Generated Text: upon a burial supposition a lake there extinguished there extinguished there imagine supposition unless there imagine supposition unless there imagine them me hazard there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 144/500, Loss: 2.5945
Epoch 144 Generated Text: upon a smile there extinguished there extinguished there imagine them there imagine this imagine a smile there extinguished there extinguished there imagine them once imagine a smile there extinguished there imagine this imagine a smile there extinguished there extinguished there imagine
done with 1 batches
Epoch 145/500, Loss: 2.5962
Epoch 145 Generated Text: upon a smile there extinguished there extinguished there imagine them there imagine them there imagine this imagine this imagine them once imagine a smile there imagine this imagine this imagine them there imagine a smile there imagine there imagine there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 146/500, Loss: 2.5894
Epoch 146 Generated Text: upon a lake there extinguished there extinguished there imagine them once imagine them once imagine this imagine this imagine upon imagine upon imagine there imagine there imagine there imagine there imagine them once imagine there imagine there imagine there imagine there imagine there imagine there extinguished there extinguished there extinguished
done with 1 batches
Epoch 147/500, Loss: 2.5860
Epoch 147 Generated Text: upon a lake there extinguished there imagine this imagine this imagine them once imagine this imagine dakkar inhabited there imagine there imagine there imagine there imagine there extinguished there extinguished there extinguished there extinguished there imagine there extinguished there extinguished there extinguished there imagine there extinguished there extinguished there extinguished
done with 1 batches
Epoch 148/500, Loss: 2.5837
Epoch 148 Generated Text: upon a lake there extinguished there extinguished there extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 149/500, Loss: 2.5791
Epoch 149 Generated Text: upon a lake there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there imagine a supposition imagine extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 150/500, Loss: 2.5723
Epoch 150 Generated Text: a lake there extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished extinguished extinguished extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 151/500, Loss: 2.5745
Epoch 151 Generated Text: a lake there extinguished there extinguished there imagine extinguished there imagine extinguished there extinguished there imagine extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished there imagine extinguished there extinguished
done with 1 batches
Epoch 152/500, Loss: 2.5733
Epoch 152 Generated Text: once once once once once once once once once once once once once once once once once once once once once there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there
done with 1 batches
Epoch 153/500, Loss: 2.5663
Epoch 153 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once on there imagine supposition a lake there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished
done with 1 batches
Epoch 154/500, Loss: 2.5634
Epoch 154 Generated Text: a lake there extinguished there extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there extinguished there extinguished there imagine extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished
done with 1 batches
Epoch 155/500, Loss: 2.5596
Epoch 155 Generated Text: upon a lake there extinguished there extinguished there extinguished there extinguished but once once once once once once once once once a burial once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 156/500, Loss: 2.5581
Epoch 156 Generated Text: upon a lake there extinguished there extinguished there extinguished there extinguished there extinguished but a smile upon a time once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 157/500, Loss: 2.5563
Epoch 157 Generated Text: a lake there extinguished there extinguished there extinguished but once once once once once once once once once once once once once once once once once once a supposition a supposition a supposition a supposition a supposition a supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition reappeared
done with 1 batches
Epoch 158/500, Loss: 2.5537
Epoch 158 Generated Text: a lake there extinguished there extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 159/500, Loss: 2.5494
Epoch 159 Generated Text: once once once once once once once once once once once once once once briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly briefly
done with 1 batches
Epoch 160/500, Loss: 2.5481
Epoch 160 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once on agree cast sentences there imagine dakkar inhabited there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 161/500, Loss: 2.5465
Epoch 161 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once on there imagine there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 162/500, Loss: 2.5423
Epoch 162 Generated Text: once once once once once once once once once once once there imagine supposition a lake there extinguished there extinguished there extinguished there imagine unexplored there extinguished there extinguished there imagine unexplored there extinguished there imagine unexplored there extinguished there extinguished there imagine unexplored there extinguished there extinguished there imagine
done with 1 batches
Epoch 163/500, Loss: 2.5377
Epoch 163 Generated Text: upon a lake there extinguished there extinguished there extinguished there extinguished there imagine them once once once once once once once once once once once once once once once once once a supposition a supposition or a supposition or a supposition or a supposition or once once once once once
done with 1 batches
Epoch 164/500, Loss: 2.5371
Epoch 164 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once a poultry supposition a lake there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished
done with 1 batches
Epoch 165/500, Loss: 2.5339
Epoch 165 Generated Text: a lake there extinguished there extinguished there extinguished there extinguished there imagine them once once once once once once once once once once a supposition agree appear once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 166/500, Loss: 2.5299
Epoch 166 Generated Text: once once once once once once once once once a door extinguished there extinguished there extinguished there imagine supposition or once imagine them once imagine them there imagine them there imagine them there imagine them there imagine a smile there imagine there imagine there imagine there imagine there imagine there
done with 1 batches
Epoch 167/500, Loss: 2.5213
Epoch 167 Generated Text: a lake there extinguished there imagine them once imagine this imagine upon imagine this imagine this imagine this imagine this imagine them once imagine this imagine this imagine there imagine this imagine there imagine them once imagine there imagine there imagine there imagine this imagine there imagine there imagine there
done with 1 batches
Epoch 168/500, Loss: 2.5246
Epoch 168 Generated Text: upon a smile there extinguished there extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished extinguished
done with 1 batches
Epoch 169/500, Loss: 2.5189
Epoch 169 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 170/500, Loss: 2.5158
Epoch 170 Generated Text: a lake there extinguished there imagine them once imagine upon imagine this imagine this imagine a door extinguished there imagine there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished there extinguished
done with 1 batches
Epoch 171/500, Loss: 2.5204
Epoch 171 Generated Text: once once once a door extinguished there extinguished there imagine this time once a supposition or once once once once once once once once once once once once once once once once once once once once once once once once a supposition or appear once once once once a supposition
done with 1 batches
Epoch 172/500, Loss: 2.5129
Epoch 172 Generated Text: upon a lake there extinguished there imagine them once a burial once once once once once once once once once once once once once once once once once once once once once once once once a lake there imagine supposition or a lake there imagine supposition or a supposition or
done with 1 batches
Epoch 173/500, Loss: 2.5086
Epoch 173 Generated Text: once once once a lake there imagine supposition or once once imagine supposition or once imagine supposition or once imagine them once imagine upon rocks john a door excavations there imagine sentences there imagine sentences extinguished there imagine there imagine there imagine there imagine there imagine there imagine there imagine
done with 1 batches
Epoch 174/500, Loss: 2.5073
Epoch 174 Generated Text: a lake there imagine inevitable time there imagine inevitable time there imagine inevitable time there imagine them once a smile there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine supposition unless there imagine
done with 1 batches
Epoch 175/500, Loss: 2.5070
Epoch 175 Generated Text: upon there once once once once once once once once once once once once once once a door of supposition or once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 176/500, Loss: 2.5000
Epoch 176 Generated Text: a lake there extinguished there extinguished there extinguished there extinguished there extinguished in a time once once once once a burial once once once once once once a supposition or supposition or supposition or once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 177/500, Loss: 2.5028
Epoch 177 Generated Text: upon a lake there extinguished there extinguished there extinguished there extinguished there imagine them once imagine upon imagine upon imagine upon imagine upon imagine upon imagine upon imagine there extinguished there extinguished there extinguished there extinguished there extinguished there imagine a sea there imagine there imagine there imagine upon imagine
done with 1 batches
Epoch 178/500, Loss: 2.4971
Epoch 178 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 179/500, Loss: 2.4926
Epoch 179 Generated Text: once once once once once once once once once a door of there imagine supposition or once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 180/500, Loss: 2.4918
Epoch 180 Generated Text: upon a lake there once once once once once once once once once once once once once once once a door of a lake there imagine supposition or supposition or supposition or supposition or once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 181/500, Loss: 2.4864
Epoch 181 Generated Text: a lake there extinguished there extinguished there extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished there imagine extinguished
done with 1 batches
Epoch 182/500, Loss: 2.4831
Epoch 182 Generated Text: a lake there imagine inevitable time there imagine inevitable time there imagine them once once once a smile dwelling hallo there imagine supposition or appear once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 183/500, Loss: 2.4864
Epoch 183 Generated Text: a lake there extinguished there imagine them once once once once once once once once once once a supposition or once once once once once once once a supposition or a supposition or a supposition or appear once once a supposition unless there imagine supposition unless there imagine supposition unless
done with 1 batches
Epoch 184/500, Loss: 2.4816
Epoch 184 Generated Text: upon a lake there imagine inevitable time once once once once once a burial once once once once once once once once once once once once once once once once once once once once once a supposition or supposition or supposition or supposition or supposition or supposition or supposition or
done with 1 batches
Epoch 185/500, Loss: 2.4801
Epoch 185 Generated Text: once once once once a lake there imagine supposition or once once once once once once once a supposition or supposition or supposition or supposition or supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine this imagine supposition or a smile extinguished there imagine upon rocks there imagine upon
done with 1 batches
Epoch 186/500, Loss: 2.4792
Epoch 186 Generated Text: once once once once once once once once once once once once once a lake there imagine supposition or supposition or supposition or supposition or once once once once once a lake there imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition that there imagine dakkar imagine
done with 1 batches
Epoch 187/500, Loss: 2.4768
Epoch 187 Generated Text: once once once once once once once once once once once once once once once once once once once once once a poultry supposition a lake there imagine supposition a door once once once once once once once once once once once once once once once once once a lake
done with 1 batches
Epoch 188/500, Loss: 2.4684
Epoch 188 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once on
done with 1 batches
Epoch 189/500, Loss: 2.4691
Epoch 189 Generated Text: once once once once once once once once once once once once once once once once once once a poultry supposition a lake there extinguished there extinguished there extinguished there extinguished there extinguished upon rocks there extinguished in a lake there once once once once a lake there once once
done with 1 batches
Epoch 190/500, Loss: 2.4678
Epoch 190 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 191/500, Loss: 2.4659
Epoch 191 Generated Text: upon a granite rocks there imagine inevitable time once a burial once once once once once once once once once once once a supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition imagine supposition
done with 1 batches
Epoch 192/500, Loss: 2.4609
Epoch 192 Generated Text: upon a burial once once once once once once once once once once once once once once once once once a door once once once once once once once once once once once once once once on a door of supposition or conceal there extinguished there extinguished there imagine supposition
done with 1 batches
Epoch 193/500, Loss: 2.4617
Epoch 193 Generated Text: once once once once once a lake there extinguished there extinguished there extinguished there imagine supposition or once once once once once once once once once once once once once once once once a supposition or conceal a supposition a supposition or a supposition or a supposition or a supposition
done with 1 batches
Epoch 194/500, Loss: 2.4572
Epoch 194 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 195/500, Loss: 2.4568
Epoch 195 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 196/500, Loss: 2.4523
Epoch 196 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 197/500, Loss: 2.4469
Epoch 197 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches
Epoch 198/500, Loss: 2.4473
Epoch 198 Generated Text: once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once once
done with 1 batches

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-20-793430ae7928> in <cell line: 13>()
     33 
     34         # Forward pass
---> 35         output = model(src, tgt_input, None, tgt_mask, src_padding_mask, tgt_padding_mask)
     36 
     37         # Compute loss

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

<ipython-input-9-2985ca1773dc> in forward(self, src, tgt, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask)
     33         src_embed = self.embedding(src) * math.sqrt(self.embedding.embedding_dim)
     34         src_embed = self.positional_encoding(src_embed)
---> 35         memory = self.encoder(src_embed, src_key_padding_mask=src_padding_mask)
     36         #print (f"Memory shape is : {memory.shape}")
     37         # Decoder: Embed and add positional encoding

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py in forward(self, src, mask, src_key_padding_mask, is_causal)
    509 
    510         for mod in self.layers:
--> 511             output = mod(
    512                 output,
    513                 src_mask=mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py in forward(self, src, src_mask, src_key_padding_mask, is_causal)
    904                 + self._sa_block(x, src_mask, src_key_padding_mask, is_causal=is_causal)
    905             )
--> 906             x = self.norm2(x + self._ff_block(x))
    907 
    908         return x

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/transformer.py in _ff_block(self, x)
    930     def _ff_block(self, x: Tensor) -> Tensor:
    931         x = self.linear2(self.dropout(self.activation(self.linear1(x))))
--> 932         return self.dropout2(x)
    933 
    934 

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py in __getattr__(self, name)
   1916     # See full discussion on the problems with returning `Union` here
   1917     # https://github.com/microsoft/pyright/issues/4213
-> 1918     def __getattr__(self, name: str) -> Any:
   1919         if "_parameters" in self.__dict__:
   1920             _parameters = self.__dict__["_parameters"]

KeyboardInterrupt: 

28.30. Generation strategies.#

import torch.nn.functional as F

def generate_text_with_sampling(model, prompt, max_len, word2id, id2word, device, top_k=5):
    model.eval()
    model.to(device)
    prompt = prompt.lower()
    encoded_prompt = encode_text(prompt, word2id)
    print (encoded_prompt)
    # Encode the prompt
    src = torch.tensor(encoded_prompt, dtype=torch.long).unsqueeze(1).to(device)
    src_padding_mask = (src == word2id["<pad>"]).T.bool()

    # Initialize target sequence with <sos>
    tgt = torch.tensor([word2id["<sos>"]], dtype=torch.long).unsqueeze(1).to(device)
    generated = []

    for _ in range(max_len):
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt.size(0)).to(device)
        output = model(src, tgt, None, tgt_mask, src_padding_mask, None)
        logits = output[-1, 0]

        # Set logits for <pad> and <unk> to a very low value
        logits[word2id["<pad>"]] = -float('inf')
        logits[word2id["<unk>"]] = -float('inf')
        # Apply top-k sampling
        top_k_logits, top_k_indices = torch.topk(logits, top_k)
        probabilities = F.softmax(top_k_logits, dim=-1)
        _next = torch.multinomial(probabilities, num_samples=1).item()
        next_token = top_k_indices[_next].item()
        if next_token == word2id["<eos>"]:
            break
        generated.append(next_token)
        tgt = torch.cat([tgt, torch.tensor([[next_token]], device=device)], dim=0)

    return " ".join([id2word[token] for token in generated])

prompt = "Hey This story is all about"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generated_text = generate_text_with_sampling(model, prompt, 128, word2id, id2word, device)
print (generated_text)
generated_text = generate_text_with_sampling(model, prompt, 128, word2id, id2word, device)
print (generated_text)

[4205, 8895, 8414, 4759, 363, 138]
polished jupiter occupant typographical dining questionable personal diligently consumed squeeze jupiter famous jupiter handed elaborate vegetate shock exasperated winged consoled inaccurate favor unable incisions great exempt fear steamer voluntarily rings interference inexplicable batteries caryophyllus whose eighteen paling belonged walnut noted banks rubbing chart asking accomplices link melting cart kentucky congratulate causes fissures ferreted split wilderness hemispheric fissures ferreted subsisted troubled descending subjects footmarks setting ground archipelago pulse hooded d immense strangers very sowed intractable phenomena contingency james armor worked orientation substituting attentively sloth great total whispers instantly undulation helter jutted gentleman mason mustaches stem sewing inestimable organ glandulous tabor greedily couple incognito plated undulation landings kindled streaks ore signifies redistributing reinstating whispers texture corridor bewailed weakness superintendence condiments coat buenos desolate there thoughtfully erections eager pendant defect voracious
[4205, 8895, 8414, 4759, 363, 138]
executed jupiter evaporation dining questionable scattered impressed caryophyllus screws fainting girl picked hardest steered caryophyllus frightfully paling norway impressions female lending fail effervescence osier transported stupor revived mischievous stretching drown soft unprotected ficoide gorge penetrates scotch miners decanted telling dynamite gestures liked thyme fish renewing undulation weakness groans claim remade whitish telling estimation ingenious cone spruce concurrently plumage provoking necessities ontario floes extorting cascade pendants night husbandry vinegar audacious nitric father begin electrical worn sharpened provoking midship thrown inaccurate ferreted tighten compliance prompt superintendence channels lain bricks cushions abbreviation hairs twofold scorpion language into glandulous flabby casing questionable back graze cited diminished naming unavoidably estates up counted eighteen whispers meantime reporter downwards extremity kauries petersburg top mineral certainly try melting cart guns sloth awakened fermented massacring tables polluted

28.31. Large Language models using Decoder Transformer architecture.#

28.32. Leveraging unlabeled data with GPT#

Key Changes for GPT

Remove the Encoder: GPT does not have an encoder; it uses only the decoder-like blocks.
Causal Masking: Ensure the self-attention mechanism is masked to prevent tokens from attending to future tokens.
Inputs and Outputs: GPT-style models typically use a single input sequence (not src and tgt) and generate predictions autoregressively.
Simplify the Positional Encoding and Masking Logic: GPT applies positional encoding and self-attention masking on the input sequence directly.

28.33. Using GPT-2 to generate new text#

import torch
import torch.nn as nn
import math

class GPTTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers, max_len):
        super(GPTTransformer, self).__init__()
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, max_len)

        # Transformer Decoder-like Layers
        self.decoder_layer = nn.TransformerDecoderLayer(
            d_model=embed_size,
            nhead=num_heads,
            dim_feedforward=hidden_dim
        )
        self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers)

        # Final Linear Layer
        self.fc_out = nn.Linear(embed_size, vocab_size)

    def forward(self, x, tgt_mask=None, tgt_padding_mask=None):
        # Embed and add positional encoding
        x_embed = self.embedding(x) * math.sqrt(self.embedding.embedding_dim)
        x_embed = self.positional_encoding(x_embed)

        # Pass through decoder layers (self-attention only)
        output = self.decoder(
            x_embed,
            memory=None,  # No encoder memory in GPT
            tgt_mask=tgt_mask,
            tgt_key_padding_mask=tgt_padding_mask,
        )

        # Map to vocabulary size
        return self.fc_out(output)


class PositionalEncoding(nn.Module):
    def __init__(self, embed_size, max_len):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, embed_size)
        positions = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, embed_size, 2) * -(math.log(10000.0) / embed_size))
        self.encoding[:, 0::2] = torch.sin(positions * div_term)
        self.encoding[:, 1::2] = torch.cos(positions * div_term)
        self.encoding = self.encoding.unsqueeze(0)

    def forward(self, x):
        seq_len = x.size(1)
        return x + self.encoding[:, :seq_len, :].to(x.device)


# Utility function for causal mask
def generate_causal_mask(size):
    mask = torch.triu(torch.ones(size, size), diagonal=1)  # Upper triangular mask
    mask = mask.masked_fill(mask == 1, float('-inf'))  # Fill with -inf
    return mask


# Example usage
vocab_size = 10000
embed_size = 512
num_heads = 8
hidden_dim = 2048
num_layers = 6
max_len = 512

model = GPTTransformer(vocab_size, embed_size, num_heads, hidden_dim, num_layers, max_len)
x = torch.randint(0, vocab_size, (32, 50))  # Batch of 32, sequence length 50

# Generate causal mask
tgt_mask = generate_causal_mask(x.size(1)).to(x.device)

# Forward pass
output = model(x, tgt_mask=tgt_mask)
print(output.shape)  # Output: (batch_size, seq_len, vocab_size)

from transformers import pipeline, set_seed


generator = pipeline('text-generation', model='gpt2')
set_seed(123)
generator("Hey readers, today is",
          max_length=20,
          num_return_sequences=3)

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

[{'generated_text': 'Hey readers, today is the third day in a row where I am starting to get a little fed'},
 {'generated_text': 'Hey readers, today is a very important weekend, and thanks to all of you, will be a'},
 {'generated_text': 'Hey readers, today is the third day of the New Year after I posted a series on the Internet'}]

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "Let us encode this sentence"
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

{'input_ids': tensor([[ 5756,   514, 37773,   428,  6827]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}

from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')
print (model)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2SdpaAttention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')

28.34. Bidirectional pre-training with BERT#

Full Name: Bidirectional Encoder Representations from Transformers
Created by: Google Research, 2018 (Devlin et al., https://arxiv.org/abs/1810.04805)
Model Size: 345M parameters (slightly larger than GPT-1, 1/5 size of GPT-2)
Architecture: Transformer encoder-based, utilizing bidirectional (nondirectional) training
- Encodes context from both preceding and succeeding words
- Strength: Produces high-quality input encodings for tasks like classification
- Limitation: Not suited for generative tasks (e.g., sentence generation)
Key Components of BERT Encoder:
1. Token Embedding
2. Positional Encoding
3. Segment Embedding (indicates token segment association)

28.35. Training Stages:#

Pre-Training: Focuses on unsupervised tasks.
Fine-Tuning: Adapts the model for specific downstream tasks.

28.36. Pre-Training Tasks#

Masked Language Modeling (MLM):
- Objective: Predict randomly masked tokens ([MASK]) in the input.
- Strategy:
  - 15% of tokens selected for masking.
  - Breakdown of handling these tokens:
    - 80% replaced with [MASK].
    - 10% replaced with a random word.
    - 10% left unchanged.
- Purpose of Adjustments:
  - Avoid inconsistencies between pre-training ([MASK] tokens) and real-world fine-tuning.
  - Preserve token information and prevent “lazy” predictions.
- Example: Word “fox” → Predicted as “fox,” “coffee,” or remains masked.
Next-Sentence Prediction (NSP):
- Objective: Classify if sentence B follows sentence A logically.
- Input Format:
  - [CLS] A [SEP] B [SEP]
    - [CLS]: Denotes start of input and placeholder for classification.
    - [SEP]: Separates sentences.
- Dataset Balance:
  - 50% of pairs are logically connected (“IsNext”).
  - 50% are random pairs (“NotNext”).

28.37. Main objective#

MLM focuses on contextual understanding within sentences.
NSP enables BERT to learn sentence relationships, essential for tasks like question answering and sentence classification.
Train on this combined metric. This is called pre training

Key Characteristics of BERT

Encoder-only Architecture:
- BERT uses the stack of encoder layers from the original Transformer model.
- It is designed for bidirectional attention, meaning each token can attend to all other tokens in the input sequence, both before and after itself.
No Decoder:
- The decoder is not included in BERT because it is not designed for generative tasks (like language modeling or text generation).
- Instead, BERT is optimized for tasks like classification, token labeling, and question-answering, where contextual understanding of the input sequence is required.
Input and Output:
- BERT processes the input sequence as a whole (e.g., [CLS] and [SEP] tokens are used for classification tasks).
- It does not generate new sequences or predictions token by token.
Pretraining Objective:
- Masked Language Modeling (MLM): Random tokens in the input are masked, and BERT is trained to predict them using the surrounding context.
- Next Sentence Prediction (NSP): BERT is trained to determine whether one sentence follows another in a given text.

28.38. BART model#

28.39. Overview#

Full Name: Bidirectional and Auto-Regressive Transformer (BART)
Developed by: Facebook AI Research, 2019 (Lewis et al., https://arxiv.org/abs/1910.13461)
Purpose: Combines strengths of BERT (bidirectional encoder) and GPT (autoregressive decoder).
- Handles both generation (e.g., summarization, translation) and classification tasks.

28.40. Key Features#

Architecture:
- Bidirectional Encoder: Context-aware input understanding.
- Autoregressive Decoder: Sequential output generation.
Corruption-Based Pre-Training:
- Input text is corrupted (e.g., masking, deletion) before encoding.
- Corruption Techniques:
  - Token masking
  - Token deletion
  - Text infilling
  - Sentence permutation
  - Document rotation
Training Process:
- Encoder: Processes corrupted input.
- Decoder: Reconstructs original text autoregressively.
- Loss: Cross-entropy between predicted and original text.

28.41. Key Applications:#

Sequence Classification: Adds a classification token, similar to [CLS] in BERT.
Token Classification: Directly uses token representations for classification.
Sequence Generation: Generates summaries or answers from context.
Machine Translation: Fine-tunes for translation tasks with added encoder layers.

28.42. Why BART is Unique:#

Combines pre-training strategies of BERT (contextual encoding) and GPT (generative capabilities).
State-of-the-art results in:
- Abstractive Summarization
- Dialogue Response Generation
- Question Answering

Several notable models use both encoder and decoder architectures, similar to BART, leveraging the strengths of both components for various natural language processing tasks. Here’s a list of some key models:

28.43. 1. T5 (Text-to-Text Transfer Transformer)#

Developed by: Google Research (2020)
Architecture: Unified encoder-decoder transformer.
Key Idea: Converts all NLP tasks into a text-to-text format (e.g., summarization, translation, classification).
Applications:
- Text summarization
- Machine translation
- Question answering
- Sentiment analysis
Unique Features:
- Task prefixes (e.g., “translate English to German:”) guide the model for specific tasks.

28.44. 2. mBART (Multilingual BART)#

Developed by: Facebook AI Research (2020)
Architecture: Extension of BART for multilingual tasks.
Key Idea: Pre-trained with denoising tasks in multiple languages, supporting language generation and translation across languages.
Applications:
- Multilingual machine translation
- Cross-lingual summarization

28.45. 3. PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization)#

Developed by: Google Research (2020)
Architecture: Encoder-decoder transformer.
Key Idea: Pre-trained using a gap-sentence generation task, which focuses on summarization-specific learning.
Applications:
- Abstractive text summarization
- Question answering

28.46. 4. ProphetNet#

Developed by: Microsoft Research (2020)
Architecture: Encoder-decoder transformer with a focus on sequence prediction.
Key Idea: Predicts future tokens and n-grams to enhance generative capabilities.
Applications:
- Text generation
- Summarization
- Translation

28.47. 5. MARGE (Multilingual Autoencoder for Retrieval-Generated Texts)#

Developed by: Facebook AI Research (2020)
Architecture: Encoder-decoder transformer.
Key Idea: Learns from retrieved documents to improve generative and understanding tasks.
Applications:
- Multilingual summarization
- Language generation

28.48. 6. UNILM (Unified Language Model)#

Developed by: Microsoft Research (2019)
Architecture: Combines encoder-decoder principles for unidirectional, bidirectional, and sequence-to-sequence tasks.
Key Idea: Unified training for both understanding (classification) and generation tasks.
Applications:
- Document summarization
- Machine translation
- Question answering

28.49. 7. BARTpho#

Developed by: VinAI Research (2021)
Architecture: Encoder-decoder pre-trained transformer for the Vietnamese language.
Key Idea: Adapts BART architecture for low-resource language processing.
Applications:
- Summarization
- Translation
- Language understanding

28.50. 8. MASS (Masked Sequence-to-Sequence Pre-training)#

Developed by: Microsoft Research Asia (2019)
Architecture: Encoder-decoder transformer pre-trained with a masked language modeling task.
Key Idea: Predicts missing segments in sequences to enhance translation and generation tasks.
Applications:
- Machine translation
- Text generation

28.51. 9. BlenderBot#

Developed by: Facebook AI Research (2020)
Architecture: Encoder-decoder transformer fine-tuned for dialogue.
Key Idea: Pre-trained on large conversational datasets for human-like interaction.
Applications:
- Open-domain dialogue systems
- Chatbots

28.52. 10. TUNiC (Transformer for Unified Neural Instruction-based Classification)#

Developed by: Academia and industry collaborations (recent models).
Architecture: Encoder-decoder with task adaptation layers for generalization.
Applications:
- Task-specific text classification
- Few-shot and zero-shot learning

28.53. Finetuning a DistilBERT Classifier Using the Lightning Trainer#

!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (17.0.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.6)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.11.9)
Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.26.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.4)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (0.2.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.18.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.6/480.6 kB 14.2 MB/s eta 0:00:00
?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 10.2 MB/s eta 0:00:00
?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.3/179.3 kB 14.9 MB/s eta 0:00:00
?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 11.9 MB/s eta 0:00:00
?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 16.2 MB/s eta 0:00:00
?25hInstalling collected packages: xxhash, fsspec, dill, multiprocess, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.10.0
    Uninstalling fsspec-2024.10.0:
      Successfully uninstalled fsspec-2024.10.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.
Successfully installed datasets-3.1.0 dill-0.3.8 fsspec-2024.9.0 multiprocess-0.70.16 xxhash-3.5.0

# 1 Loading the Dataset
from datasets import load_dataset
imdb_data = load_dataset("imdb")
print(imdb_data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

imdb_data = load_dataset("imdb")
print(imdb_data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm cd into the download directory and execute

tar -zxf aclImdb_v1.tar.gz

B) If you are working with Windows, download an archiver such as 7Zip to extract the files from the download archive.

C) Use the following code to download and unzip the dataset via Python

import os
import sys
import tarfile
import time
import urllib.request

source = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
target = "aclImdb_v1.tar.gz"

if os.path.exists(target):
    os.remove(target)


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.0**2 * duration)
    percent = count * block_size * 100.0 / total_size

    sys.stdout.write(
        f"\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB "
        f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed"
    )
    sys.stdout.flush()


if not os.path.isdir("aclImdb") and not os.path.isfile("aclImdb_v1.tar.gz"):
    urllib.request.urlretrieve(source, target, reporthook)

if not os.path.isdir("aclImdb"):

    with tarfile.open(target, "r:gz") as tar:
        tar.extractall()

# convert dataframe and save as csv

import os
import sys

import numpy as np
import pandas as pd
from packaging import version
from tqdm import tqdm

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = "aclImdb"

labels = {"pos": 1, "neg": 0}

df = pd.DataFrame()

with tqdm(total=50000) as pbar:
    for s in ("test", "train"):
        for l in ("pos", "neg"):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                    txt = infile.read()

                if version.parse(pd.__version__) >= version.parse("1.3.2"):
                    x = pd.DataFrame(
                        [[txt, labels[l]]], columns=["review", "sentiment"]
                    )
                    df = pd.concat([df, x], ignore_index=False)

                else:
                    df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()
df.columns = ["text", "label"]

import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

28.54. Basic checks#

print("Class distribution:")
np.bincount(df["label"].values)

text_len = df["text"].apply(lambda x: len(x.split()))
text_len.min(), text_len.median(), text_len.max()

28.55. Splitting into training validation and testing#

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:35_000]
df_val = df_shuffled.iloc[35_000:40_000]
df_test = df_shuffled.iloc[40_000:]

df_train.to_csv("train.csv", index=False, encoding="utf-8")
df_val.to_csv("validation.csv", index=False, encoding="utf-8")
df_test.to_csv("test.csv", index=False, encoding="utf-8")

28.56. Tokenization and numericalization#

imdb_dataset = load_dataset(
    "csv",
    data_files={
        "train": "train.csv",
        "validation": "validation.csv",
        "test": "test.csv",
    },
)

print(imdb_dataset)

Tokenize the dataset

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
print("Tokenizer input max length:", tokenizer.model_max_length)
print("Tokenizer vocabulary size:", tokenizer.vocab_size)

def tokenize_text(batch):
    return tokenizer(batch["text"], truncation=True, padding=True)

imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)

#del imdb_dataset

imdb_tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

28.57. Setup dataloaders#

from torch.utils.data import DataLoader, Dataset


class IMDBDataset(Dataset):
    def __init__(self, dataset_dict, partition_key="train"):
        self.partition = dataset_dict[partition_key]

    def __getitem__(self, index):
        return self.partition[index]

    def __len__(self):
        return self.partition.num_rows

train_dataset = IMDBDataset(imdb_tokenized, partition_key="train")
val_dataset = IMDBDataset(imdb_tokenized, partition_key="validation")
test_dataset = IMDBDataset(imdb_tokenized, partition_key="test")

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=12,
    shuffle=True,
    num_workers=4
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=12,
    num_workers=4
)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=12,
    num_workers=4
)

### Initializing DistilBERT

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)

28.58. Finetuning with lightning#

# wrap in lightning module

import lightning as L
import torch
import torchmetrics


class LightningModel(L.LightningModule):
    def __init__(self, model, learning_rate=5e-5):
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model

        self.val_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)
        self.test_acc = torchmetrics.Accuracy(task="multiclass", num_classes=2)

    def forward(self, input_ids, attention_mask, labels):
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)

    def training_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("train_loss", outputs["loss"])
        return outputs["loss"]  # this is passed to the optimizer for training

    def validation_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])
        self.log("val_loss", outputs["loss"], prog_bar=True)

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.val_acc(predicted_labels, batch["label"])
        self.log("val_acc", self.val_acc, prog_bar=True)

    def test_step(self, batch, batch_idx):
        outputs = self(batch["input_ids"], attention_mask=batch["attention_mask"],
                       labels=batch["label"])

        logits = outputs["logits"]
        predicted_labels = torch.argmax(logits, 1)
        self.test_acc(predicted_labels, batch["label"])
        self.log("accuracy", self.test_acc, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer


lightning_model = LightningModel(model)

from lightning.pytorch.callbacks import ModelCheckpoint
from lightning.pytorch.loggers import CSVLogger


callbacks = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]
logger = CSVLogger(save_dir="logs/", name="my-model")

trainer = L.Trainer(
    max_epochs=3,
    callbacks=callbacks,
    accelerator="cpu",
    devices=1,
    logger=logger,
    log_every_n_steps=10,
)

trainer.fit(model=lightning_model,
            train_dataloaders=train_loader,
            val_dataloaders=val_loader)

INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: 
  | Name     | Type                                | Params | Mode 
-------------------------------------------------------------------------
0 | model    | DistilBertForSequenceClassification | 67.0 M | eval 
1 | val_acc  | MulticlassAccuracy                  | 0      | train
2 | test_acc | MulticlassAccuracy                  | 0      | train
-------------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
2         Modules in train mode
96        Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
  | Name     | Type                                | Params | Mode 
-------------------------------------------------------------------------
0 | model    | DistilBertForSequenceClassification | 67.0 M | eval 
1 | val_acc  | MulticlassAccuracy                  | 0      | train
2 | test_acc | MulticlassAccuracy                  | 0      | train
-------------------------------------------------------------------------
67.0 M    Trainable params
0         Non-trainable params
67.0 M    Total params
267.820   Total estimated model params size (MB)
2         Modules in train mode
96        Modules in eval mode

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:617: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(

trainer.test(lightning_model, dataloaders=train_loader, ckpt_path="best")

trainer.test(lightning_model, dataloaders=val_loader, ckpt_path="best")

trainer.test(lightning_model, dataloaders=test_loader, ckpt_path="best")