Credits: inspired by Raschka et al., Chap 15 - substantial changes applied

22. Sequential Data Using Recurrent Neural Networks#


22.1. Warming up#

# URL of the image from GitHub
img_url2 = "https://github.com/cfteach/NNDL_DATA621/blob/10b57c7be9d3f31989a4f8cca7d21cccbae7754c/DATA621/DATA621/images/Hidden_vs_Output_Recurrence.png"
# Display the image
Image(url=img_url2, width = 600)
import torch
import torch.nn as nn

torch.manual_seed(2)

rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True)

w_xh = rnn_layer.weight_ih_l0  #by default they are initialized randomly
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0

print('W_xh shape:', w_xh.shape)
print('W_hh shape:', w_hh.shape)
print('b_xh shape:', b_xh.shape)
print('b_hh shape:', b_hh.shape)
W_xh shape: torch.Size([2, 5])
W_hh shape: torch.Size([2, 2])
b_xh shape: torch.Size([2])
b_hh shape: torch.Size([2])
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()
print(x_seq.shape)
torch.Size([3, 5])
w_xh
Parameter containing:
tensor([[ 0.1622, -0.1683,  0.1939, -0.0361,  0.3021],
        [ 0.1683, -0.0813, -0.5717,  0.1614, -0.6260]], requires_grad=True)
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()

## output of the simple RNN:
output, hn = rnn_layer(torch.reshape(x_seq, (1, 3, 5)))

# hn will correspond to the hidden state after the last time step

#

print('\n\n')
print('Output shape:', output.shape)
print('Output tensor:')
print(output)
print('\nHidden shape:', hn.shape)
print('Hidden tensor:')
print(hn)
print('\n\n')

## manually computing the output:
out_man = []
for t in range(3):
    xt = torch.reshape(x_seq[t], (1, 5))
    print(f'Time step {t} =>')
    print('   Input           :', xt.numpy())

    ht = torch.matmul(xt, torch.transpose(w_xh, 0, 1)) + b_xh
    print('   Hidden (manual)           :', ht.detach().numpy())
    #print('   Hidden (PyTorch)       :', hn[:, t].detach().numpy())


    if t>0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))

    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print('   \033[1mOutput (manual)           :\033[0m', ot.detach().numpy())
    print('   \033[1mRNN Output  (PyTorch)     :\033[0m', output[:, t].detach().numpy())
    print()
Output shape: torch.Size([1, 3, 2])
Output tensor:
tensor([[[ 0.6642, -0.7906],
         [ 0.8561, -0.9886],
         [ 0.9403, -0.9987]]], grad_fn=<TransposeBackward1>)

Hidden shape: torch.Size([1, 1, 2])
Hidden tensor:
tensor([[[ 0.9403, -0.9987]]], grad_fn=<StackBackward0>)



Time step 0 =>
   Input           : [[1. 1. 1. 1. 1.]]
   Hidden (manual)           : [[ 0.50097895 -0.6559663 ]]
   Output (manual)           : [[ 0.6641613 -0.7906304]]
   RNN Output  (PyTorch)     : [[ 0.6641613 -0.7906304]]

Time step 1 =>
   Input           : [[2. 2. 2. 2. 2.]]
   Hidden (manual)           : [[ 0.9547678 -1.6051545]]
   Output (manual)           : [[ 0.8561063 -0.9886436]]
   RNN Output  (PyTorch)     : [[ 0.8561063 -0.9886436]]

Time step 2 =>
   Input           : [[3. 3. 3. 3. 3.]]
   Hidden (manual)           : [[ 1.4085568 -2.5543427]]
   Output (manual)           : [[ 0.9403312 -0.9987188]]
   RNN Output  (PyTorch)     : [[ 0.9403312 -0.9987188]]

22.2. Working with the Internet Movie Database (IMDb)#

from IPython.display import Image
%matplotlib inline
import torch
import torch.nn as nn
pip show torch
Name: torch
Version: 2.5.0+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, fastai, timm, torchaudio, torchvision
!pip install datasets
Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.16.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from datasets) (1.26.4)
Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (16.1.0)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (2.2.2)
Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in /usr/local/lib/python3.10/dist-packages (from datasets) (4.66.5)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.6.1)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.10.10)
Requirement already satisfied: huggingface-hub>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.24.7)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (24.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (2.4.3)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (24.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.1.0)
Requirement already satisfied: yarl<2.0,>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.16.0)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets) (2024.8.30)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.12.0->aiohttp->datasets) (0.2.0)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 472.7/472.7 kB 13.1 MB/s eta 0:00:00
?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 116.3/116.3 kB 10.4 MB/s eta 0:00:00
?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 10.6 MB/s eta 0:00:00
?25hDownloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 194.1/194.1 kB 14.7 MB/s eta 0:00:00
?25hInstalling collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-3.0.2 dill-0.3.8 multiprocess-0.70.16 xxhash-3.5.0
from datasets import load_dataset
from torch.utils.data import Dataset, random_split, DataLoader
import torch

# Step 1: Load the IMDB dataset from Hugging Face
imdb = load_dataset("imdb")

# Step 2: Create a PyTorch Dataset wrapper
class IMDBDataset(Dataset):
    def __init__(self, hf_dataset):
        # Hugging Face dataset is passed as an argument
        self.data = hf_dataset

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Return the text and the label for each item
        return {
            'text': self.data[idx]['text'],
            'label': self.data[idx]['label']
        }

# Step 3: Wrap Hugging Face datasets into PyTorch-compatible Datasets
train_dataset = IMDBDataset(imdb['train'])
test_dataset = IMDBDataset(imdb['test'])

# Step 4: Split the train dataset into train and validation sets using random_split
torch.manual_seed(1)  # For reproducibility

# You want 20,000 for training and 5,000 for validation
train_dataset, valid_dataset = random_split(train_dataset, [20000, 5000])

# Step 5: Create DataLoader for each dataset (including the test dataset)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)  # No shuffling for test data

# Step 6: Example of iterating over a test_loader batch
for batch in test_loader:
    print(batch['text'], batch['label'])  # Access text and labels from the test set
    break  # Print just one batch as an example
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
['I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichΓ©d and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they have to always say "Gene Roddenberry\'s Earth..." otherwise people would not continue watching. Roddenberry\'s ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.', "Worth the entertainment value of a rental, especially if you like action movies. This one features the usual car chases, fights with the great Van Damme kick style, shooting battles with the 40 shell load shotgun, and even terrorist style bombs. All of this is entertaining and competently handled but there is nothing that really blows you away if you've seen your share before.<br /><br />The plot is made interesting by the inclusion of a rabbit, which is clever but hardly profound. Many of the characters are heavily stereotyped -- the angry veterans, the terrified illegal aliens, the crooked cops, the indifferent feds, the bitchy tough lady station head, the crooked politician, the fat federale who looks like he was typecast as the Mexican in a Hollywood movie from the 1940s. All passably acted but again nothing special.<br /><br />I thought the main villains were pretty well done and fairly well acted. By the end of the movie you certainly knew who the good guys were and weren't. There was an emotional lift as the really bad ones got their just deserts. Very simplistic, but then you weren't expecting Hamlet, right? The only thing I found really annoying was the constant cuts to VDs daughter during the last fight scene.<br /><br />Not bad. Not good. Passable 4.", "its a totally average film with a few semi-alright action sequences that make the plot seem a little better and remind the viewer of the classic van dam films. parts of the plot don't make sense and seem to be added in to use up time. the end plot is that of a very basic type that doesn't leave the viewer guessing and any twists are obvious from the beginning. the end scene with the flask backs don't make sense as they are added in and seem to have little relevance to the history of van dam's character. not really worth watching again, bit disappointed in the end production, even though it is apparent it was shot on a low budget certain shots and sections in the film are of poor directed quality", "STAR RATING: ***** Saturday Night **** Friday Night *** Friday Morning ** Sunday Night * Monday Morning <br /><br />Former New Orleans homicide cop Jack Robideaux (Jean Claude Van Damme) is re-assigned to Columbus, a small but violent town in Mexico to help the police there with their efforts to stop a major heroin smuggling operation into their town. The culprits turn out to be ex-military, lead by former commander Benjamin Meyers (Stephen Lord, otherwise known as Jase from East Enders) who is using a special method he learned in Afghanistan to fight off his opponents. But Jack has a more personal reason for taking him down, that draws the two men into an explosive final showdown where only one will walk away alive.<br /><br />After Until Death, Van Damme appeared to be on a high, showing he could make the best straight to video films in the action market. While that was a far more drama oriented film, with The Shepherd he has returned to the high-kicking, no brainer action that first made him famous and has sadly produced his worst film since Derailed. It's nowhere near as bad as that film, but what I said still stands.<br /><br />A dull, predictable film, with very little in the way of any exciting action. What little there is mainly consists of some limp fight scenes, trying to look cool and trendy with some cheap slo-mo/sped up effects added to them that sadly instead make them look more desperate. Being a Mexican set film, director Isaac Florentine has tried to give the film a Robert Rodriguez/Desperado sort of feel, but this only adds to the desperation.<br /><br />VD gives a particularly uninspired performance and given he's never been a Robert De Niro sort of actor, that can't be good. As the villain, Lord shouldn't expect to leave the beeb anytime soon. He gets little dialogue at the beginning as he struggles to muster an American accent but gets mysteriously better towards the end. All the supporting cast are equally bland, and do nothing to raise the films spirits at all.<br /><br />This is one shepherd that's strayed right from the flock. *", "First off let me say, If you haven't enjoyed a Van Damme movie since bloodsport, you probably will not like this movie. Most of these movies may not have the best plots or best actors but I enjoy these kinds of movies for what they are. This movie is much better than any of the movies the other action guys (Segal and Dolph) have thought about putting out the past few years. Van Damme is good in the movie, the movie is only worth watching to Van Damme fans. It is not as good as Wake of Death (which i highly recommend to anyone of likes Van Damme) or In hell but, in my opinion it's worth watching. It has the same type of feel to it as Nowhere to Run. Good fun stuff!", "I had high hopes for this one until they changed the name to 'The Shepherd : Border Patrol, the lamest movie name ever, what was wrong with just 'The Shepherd'. This is a by the numbers action flick that tips its hat at many classic Van Damme films. There is a nice bit of action in a bar which reminded me of hard target and universal soldier but directed with no intensity or flair which is a shame. There is one great line about 'being p*ss drunk and carrying a rabbit' and some OK action scenes let down by the cheapness of it all. A lot of the times the dialogue doesn't match the characters mouth and the stunt men fall down dead a split second before even being shot. The end fight is one of the better Van Damme fights except the Director tries to go a bit too John Woo and fails also introducing flashbacks which no one really cares about just gets in the way of the action which is the whole point of a van Damme film.<br /><br />Not good, not bad, just average generic action.", "Isaac Florentine has made some of the best western Martial Arts action movies ever produced. In particular US Seals 2, Cold Harvest, Special Forces and Undisputed 2 are all action classics. You can tell Isaac has a real passion for the genre and his films are always eventful, creative and sharp affairs, with some of the best fight sequences an action fan could hope for. In particular he has found a muse with Scott Adkins, as talented an actor and action performer as you could hope for. This is borne out with Special Forces and Undisputed 2, but unfortunately The Shepherd just doesn't live up to their abilities.<br /><br />There is no doubt that JCVD looks better here fight-wise than he has done in years, especially in the fight he has (for pretty much no reason) in a prison cell, and in the final showdown with Scott, but look in his eyes. JCVD seems to be dead inside. There's nothing in his eyes at all. It's like he just doesn't care about anything throughout the whole film. And this is the leading man.<br /><br />There are other dodgy aspects to the film, script-wise and visually, but the main problem is that you are utterly unable to empathise with the hero of the film. A genuine shame as I know we all wanted this film to be as special as it genuinely could have been. There are some good bits, mostly the action scenes themselves. This film had a terrific director and action choreographer, and an awesome opponent for JCVD to face down. This could have been the one to bring the veteran action star back up to scratch in the balls-out action movie stakes.<br /><br />Sincerely a shame that this didn't happen.", "It actually pains me to say it, but this movie was horrible on every level. The blame does not lie entirely with Van Damme as you can see he tried his best, but let's face it, he's almost fifty, how much more can you ask of him? I find it so hard to believe that the same people who put together Undisputed 2; arguably the best (western) martial arts movie in years, created this. Everything from the plot, to the dialog, to the editing, to the overall acting was just horribly put together and in many cases outright boring and nonsensical. Scott Adkins who's fight scenes seemed more like a demo reel, was also terribly underused and not even the main villain which is such a shame because 1) He is more than capable of playing that role and 2) The actual main villain was not only not intimidating at all but also quite annoying. Again, not blaming Van Damme. I will always be a fan, but avoid this one.", "Technically I'am a Van Damme Fan, or I was. this movie is so bad that I hated myself for wasting those 90 minutes. Do not let the name Isaac Florentine (Undisputed II) fool you, I had big hopes for this one, depending on what I saw in (Undisputed II), man.. was I wrong ??! all action fans wanted a big comeback for the classic action hero, but i guess we wont be able to see that soon, as our hero keep coming with those (going -to-a-border - far-away-town-and -kill -the-bad-guys- than-comeback- home) movies I mean for God's sake, we are in 2008, and they insist on doing those disappointing movies on every level. Why ??!!! Do your self a favor, skip it.. seriously.", 'Honestly awful film, bad editing, awful lighting, dire dialog and scrappy screenplay.<br /><br />The lighting at is so bad there\'s moments you can\'t even see what\'s going on, I even tried to playing with the contrast and brightness so I could see something but that didn\'t help.<br /><br />They must have found the script in a bin, the character development is just as awful and while you hardly expect much from a Jean-Claude Van Damme film this one manages to hit an all time low. You can\'t even laugh at the cheesy\'ness.<br /><br />The directing and editing are also terrible, the whole film follows an extremely tired routine and fails at every turn as it bumbles through the plot that is so weak it\'s just unreal.<br /><br />There\'s not a lot else to say other than it\'s really bad and nothing like Jean-Claude Van Damme\'s earlier work which you could enjoy.<br /><br />Avoid like the plaque, frankly words fail me in condemning this "film".', 'This flick is a waste of time.I expect from an action movie to have more than 2 explosions and some shooting.Van Damme\'s acting is awful. He never was much of an actor, but here it is worse.He was definitely better in his earlier movies. His screenplay part for the whole movie was probably not more than one page of stupid nonsense one liners.The whole dialog in the film is a disaster, same as the plot.The title "The Shepherd" makes no sense. Why didn\'t they just call it "Border patrol"? The fighting scenes could have been better, but either they weren\'t able to afford it, or the fighting choreographer was suffering from lack of ideas.This is a cheap low type of action cinema.', "Blind Date (Columbia Pictures, 1934), was a decent film, but I have a few issues with this film. First of all, I don't fault the actors in this film at all, but more or less, I have a problem with the script. Also, I understand that this film was made in the 1930's and people were looking to escape reality, but the script made Ann Sothern's character look weak. She kept going back and forth between suitors and I felt as though she should have stayed with Paul Kelly's character in the end. He truly did care about her and her family and would have done anything for her and he did by giving her up in the end to fickle Neil Hamilton who in my opinion was only out for a good time. Paul Kelly's character, although a workaholic was a man of integrity and truly loved Kitty (Ann Sothern) as opposed to Neil Hamilton, while he did like her a lot, I didn't see the depth of love that he had for her character. The production values were great, but the script could have used a little work.", 'I first watched this movie back in the mid/late 80\'s, when I was a kid. We couldn\'t even get all the way through it. The dialog, the acting, everything about it was just beyond lame.<br /><br />Here are a few examples... imagine these spoken real dramatically, way over-acted: "Oreegon? You\'re going to Oreegon? Why would anyone want to go to Oreegon?"<br /><br />"Survivalists? Nobody ever told us about any survivalists!"<br /><br />This movie was SO bad, my sister and I rented it again for her 16th birthday party, just so our friends could sit around and laugh at how awful it was. I don\'t think we were able to finish it then either!', 'I saw the Mogul Video VHS of this. That\'s another one of those old 1980s distributors whose catalog I wish I had!<br /><br />This movie was pretty poor. Though retitled "Don\'t Look in the Attic," the main admonition that is repeated in this is "Don\'t go to the villa." Just getting on the grounds of the villa is a bad idea. A character doesn\'t go into the attic until an hour into the movie, and actually should have done it earlier because of what is learned there.<br /><br />The movie starts in Turin, Italy in the 1950s. Two men are fighting, and a woman is telling them the villa is making them do it. One man kills the other, then regrets it, and the woman pulls out the knife and stabs him with it. She flees the villa, and after she\'s left a chair moves by itself (what\'s the point of that?), but when in the garden a hand comes up through the ground and drags he into the earth.<br /><br />From there, it\'s the present day, thirty years later. There\'s a sΓ©ance that appears suddenly and doesn\'t appear to have anything to do with the movie. The children of the woman from the prologue are inheriting the house. The main daughter is played by the same actress who played her mother. At least one of the two men from the prologue seems to reoccur as another character too. She\'s haunted by some warnings not to go to the villa, but they all do, since if they do not use it, they forfeit it. People die. A lawyer who has won all his cases tries to investigate a little. The ending is pretty poor. Why was the family cursed? An unfortunately boring movie.<br /><br />There\'s an amusing small-print disclaimer on the back of the video box that reads "The scenes depicted on this packaging may be an artist\'s impression and may not necessarily represent actual scenes from the film." In this case, the cover of the box is an illustration that does more or less accurately depict the aforementioned woman dragged underground scene, although there are two hands, and the woman is different. It\'s true, sometimes the cover art has nothing to do with the movie. I also recall seeing a reviewer who had a bad movie predictor scale, in which movies with illustrations on the cover instead of photos got at least one point for that.', "A group of heirs to a mysterious old mansion find out that they have to live in it as part of a clause in the will or be disinherited, but they soon find out of its history of everybody whom had lived there before them having either died in weird accidents or having had killed each other.<br /><br />You've seen it all before, and this one is too low-budget and slow paced to be scary, and doesn't have any real surprises in the climax. No special effects or gore to speak of, in fact the only really amusing thing about the whole film is the quality of the English dubbing, which at times is as bad as a cheap martial arts movie.<br /><br />3 out of 10, pretty low in the pecking order of 80's haunted house movies.", 'Now, I LOVE Italian horror films. The cheesier they are, the better. However, this is not cheesy Italian. This is week-old spaghetti sauce with rotting meatballs. It is amateur hour on every level. There is no suspense, no horror, with just a few drops of blood scattered around to remind you that you are in fact watching a horror film. The "special effects" consist of the lights changing to red whenever the ghost (or whatever it was supposed to be) is around, and a string pulling bed sheets up and down. Oooh, can you feel the chills? The DVD quality is that of a VHS transfer (which actually helps the film more than hurts it). The dubbing is below even the lowest "bad Italian movie" standards and I gave it one star just because the dialogue is so hilarious! And what do we discover when she finally DOES look in the attic (in a scene that is daytime one minute and night the next)...well, I won\'t spoil it for anyone who really wants to see, but let\'s just say that it isn\'t very "novel"!', "This cheap, grainy-filmed Italian flick is about a couple of inheritors of a manor in the Italian countryside who head up to the house to stay, and then find themselves getting killed off by ghosts of people killed in that house.<br /><br />I wasn't impressed by this. It wasn't really that scary, mostly just the way a cheap Italian film should be. A girl, her two cousins, and one cousin's girlfriend, head to this huge house for some reason (I couldn't figure out why) and are staying there, cleaning up and checking out the place. Characters come in and out of the film, and it's quite boring at points, and the majority of deaths are quite rushed. The girlfriend is hit by a car when fleeing the house after having a dream of her death, and the scene is quite good, but then things get slow again, until a confusing end, when the male cousins are killed together in some weird way, and this weirdo guy (I couldn't figure out who he was during the movie, or maybe I just don't remember) goes after this one girl, attacking her, until finally this other girl kills him off. Hate to give away the ending, but oh well. The female cousin decides to stay at the house and watch over it, and they show scenes of her living there years later. The end. You really aren't missing anything, and anyway, you probably won't find this anywhere, so lucky you.", "I just finished watching this movie and am disappointed to say that I didn't enjoy it a bit. It is so slow Slow and uninteresting. This kid from Harry Potter plays a shy teenager with an rude mother, and then one day the rude mother tells the kid to find a job so that they could accommodate an old guy apparently having no place to live has started to live with his family and therefore the kid goes to work for a old lady. And this old lady who is living all alone teaches him about girls, driving car and life! I couldn't get how an 18 year old guy enjoy spending time with an awful lady in her 80s. Sorry if my comments on this movie has bothered people who might have enjoyed it, I could be wrong as I am not British and may not understand the social and their family structure and way of life. Mostly the movie is made for the British audience.", "Ben, (Rupert Grint), is a deeply unhappy adolescent, the son of his unhappily married parents. His father, (Nicholas Farrell), is a vicar and his mother, (Laura Linney), is ... well, let's just say she's a somewhat hypocritical soldier in Jesus' army. It's only when he takes a summer job as an assistant to a foul-mouthed, eccentric, once-famous and now-forgotten actress Evie Walton, (Julie Walters), that he finally finds himself in true 'Harold and Maude' fashion. Of course, Evie is deeply unhappy herself and it's only when these two sad sacks find each other that they can put their mutual misery aside and hit the road to happiness.<br /><br />Of course it's corny and sentimental and very predictable but it has a hard side to it, too and Walters, who could sleep-walk her way through this sort of thing if she wanted, is excellent. It's when she puts the craziness to one side and finds the pathos in the character, (like hitting the bottle and throwing up in the sink), that she's at her best. The problem is she's the only interesting character in the film (and it's not because of the script which doesn't do anybody any favours). Grint, on the other hand, isn't just unhappy; he's a bit of a bore as well while Linney's starched bitch is completely one-dimensional. (Still, she's got the English accent off pat). The best that can be said for it is that it's mildly enjoyable - with the emphasis on the mildly.", 'Every movie I have PPV\'d because Leonard Maltin praised it to the skies has blown chunks! Every single one! When will I ever learn?<br /><br />Evie is a raving Old Bag who thinks nothing of saying she\'s dying of breast cancer to get her way! Laura is an insufferable Medusa filled with The Holy Spirit (and her hubby\'s protΓ©gΓ©)! Caught between these harpies is Medusa\'s dumb-as-a-rock boy who has been pressed into weed-pulling servitude by The Old Bag!<br /><br />As I said, when will I ever learn?<br /><br />I was temporarily lifted out of my malaise when The Old Bag stuck her head in a sink, but, unfortunately, she did not die. I was temporarily lifted out of my malaise again when Medusa got mowed down, but, unfortunately, she did not die. It should be a capital offense to torture audiences like this!<br /><br />Without Harry Potter to kick him around, Rupert Grint is just a pair of big blue eyes that practically bulge out of its sockets. Julie Walters\'s scenery-chewing (especially the scene when she "plays" God) is even more shameless than her character.<br /><br />At least this Harold bangs some bimbo instead of Maude. For that, I am truly grateful. And if you\'re reading this Mr. Maltin, you owe me $3.99!', "Low budget horror movie. If you don't raise your expectations too high, you'll probably enjoy this little flick. Beginning and end are pretty good, middle drags at times and seems to go nowhere for long periods as we watch the goings on of the insane that add atmosphere but do not advance the plot. Quite a bit of gore. I enjoyed Bill McGhee's performance which he made quite believable for such a low budget picture, he managed to carry the movie at times when nothing much seemed to be happening. Nurse Charlotte Beale, played by Jesse Lee, played her character well so be prepared to want to slap her toward the end! She makes some really stupid mistakes but then, that's what makes these low budget movies so good! I would have been out of that place and five states away long before she even considered that it might be a good idea to leave! If you enjoy this movie, try Committed from 1988 which is basically a rip off of this movie.", "Dr Stephens (Micheal Harvey) runs a mental asylum. He has a different approach to the insane. He conducts unorthodox methods of treatment. He treats everyone like family, there are no locks on the patients doors and he lets some of the inmates act out their twisted fantasies. He lets Sergeant Jaffee (Hugh Feagin) dress and act as a soldier and Harriet (Camilla Carr) be a mother to a doll, including letting her put it to bed in a cot. Dr. Stevens is outside letting Judge Oliver W. Cameron (Gene Ross) chop a log up with an axe, it turns out to be a bad move as once Dr. Stevens back is turned the Judge plants the axe in his shoulder. Soon after Nurse Charlotte Beale (Rosie Holotik) arrives at the Sanitarium having arranged an interview with Dr. Stevens about a possible job. She is met by the head Nurse, Geraldine Masters (Annabelle Weenick as Anne McAdams) and is offered a trail position. She gets to know and becomes well liked among the patients. However things eventually start to turn sour, the phone lines are cut, an old lady named Mrs. Callingham (Rhea MacAdams) has her tongue cut out and she starts to get a strange feeling that things just aren't right somehow. Then, one night all the Sanitariums dark secrets are violently revealed. Produced and directed by S.f. Brownrigg this film has a great central idea which builds into a cool twist ending, but ultimately is a bit of a chore to sit through because of it's low budget restrictions and a rather slow script by Tim Pope. There are just too many long boring stretches of dialogue by the inmates, not a lot really happens until the final twenty odd minutes. The film has no real visual quality as it's set entirely in the Sanitarium and it's grounds which is basically just a big bland house in the middle of nowhere. There's no graphic gore in it, a few splashes of blood here and there and thats yer lot. There's a bit of nudity, but like the gore not much. The acting is pretty strong, especially Holotik and Weenick. The photography is flat and unexciting and I can't even remember what the music was like. The twist ending is great, but it just takes far too long to get to it. A film that had a lot of potential that was probably held back by it's budget. OK I guess, but I think it would have worked a lot better if the story had been turned into a half an hour 'Tales form the Crypt' episode.", "The Forgotten (AKA: Don't Look In The Basement) is a very cheaply made and very old looking horror movie.<br /><br />The story is very slow and never really reaches anything worth getting excited about.<br /><br />The patients at the asylum are embarrassingly funny especially Sam and the old woman who always quotes an old saying to everyone. (Look out for the bit when she gets close to the camera, tell me you can watch without laughing!).<br /><br />Now the gore is very poor looking, with the blood looking pink in many scenes so it doesn't really deserve its place on the video nasties list!.<br /><br />Overall if you aren't looking for a fantastic horror film and have some time to spare then it's worth a watch.", 'This movie had a very unique effect on me: it stalled my realization that this movie REALLY sucks! It is disguised as a "thinker\'s film" in the likes of Memento and other jewels like that, but at the end, and even after a few minutes, you come to realize that this is nothing but utter pretentious cr4p. Probably written by some collage student with friends to compassionate to tell him that his writing sucks. The whole idea is \x85 I don\'t even know if it tried to scratch on the supernatural, or they want us to believe that because someone fills your mind (a very weak one, btw) with stupid "riddles", the kind you learn on elementary school recess, you suddenly come to the "one truth" about everything, then you have to kill someone and confess\x85. !!! What? How, what, why, WHY? Is just like saying that to make a cake, just throw a bunch of ingredients, and add water\x85 forgot about cooking it? I guess these guys forgot to, not explain, but present the mechanism of WHY was this happening? You have to do that when you present a story which normal, everyday acts (lie solving riddle rhymes) start to have an abnormal effect on people. Acting was horrible, with that girl always trying to look cute at the camera, and the guy from Highlanders, the series, acting up like the though heavy metal record store (yeah, they\'re all real though s-o-b\'s). The "menacing" atmosphere, with the "oh-so-clever" riddles (enter the 60\'s series of Batman and Robin, with guest appearance of The Riddle) and the crazies who claim to have "the knowledge" behind that smirk on their faces\x85 just horrible, HORRIBLE.<br /><br />I\'m usually very partial about low budget movies, and tend to root for the underdog by giving them more praise than they may deserve, in lieu of their constrictions, you know, but this is just an ugly excuse for a movie that will keep you wanting to be good for an hour and a half, and at the end you will just lament that you fell for it.', 'too bad this movie isn\'t. While "Nemesis Game" is mildly entertaining, I found it hard to suspend my disbelief the whole length of the movie, especially the situations that Sara was putting herself into. Are we supposed to believe that:<br /><br />1) this hot chick is going to go slumming unarmed around abandoned buildings and dark subway tunnels in the middle of the night just to solve some riddles?<br /><br />2) the protagonists are supposedly such experts that they play riddle games for fun, but don\'t put the whole "I Never Sinned" riddle together until the very end...and then...and then...get this...she has to do the whole mirror thing to finally put the pieces together?? I know it was the filmmaker\'s device to show the audience what was going on, but do they really think we\'re that stupid?<br /><br />3) when Vern and Sara go to the Chez M to question the blonde, there is not ONE topless chick in the whole building. Nada. C\'mon. I know it\'s Canada, but I would expect more from a country that gave us Shannon Tweed.<br /><br />And anyone else notice that when Vern was surfing the Web and found that riddlezone site, that when he moused over the link the cursor stayed an arrow, and didn\'t turn into a little hand (LIKE ALL CURSORS DO WHEN YOU CLICK ON A HYPERLINK)?!? I mean, if you\'re gonna have the internet play such a prominent role in your movie, at least get the little things right. Geez.', 'I of course saw the previews for this at the beginning of some other Lion\'s Gate extravaganza, so of course it was only the best parts and therefore looked intriguing. And it is, to a point. A young college student (Sarah)is finding riddles all over the place and is becoming obsessed with answering them, and in doing so she\'s unwittingly becoming involved in some game. Now that\'s fairly intriguing right there but unfortunately it all gets rather muddled and becomes so complicated that the viewer (like myself) will most likely become frustrated. Characters appear with little introduction and you\'re not really sure who they are or why Sarah knows them or is hanging out with them. All of this has something to do with this woman who tried to drown a young boy years ago and her reason for that was that it\'s "all part of the design". In reality, it\'s all part of the "very sketchy script" and when the film is over you\'ll find yourself feeling that you\'ve lost about an hour and a half of your life that you want back for more productive uses of your time, like cleaning the bathroom, for instance. 4 out of 10.', "I gave this a 3 out of a possible 10 stars.<br /><br />Unless you like wasting your time watching an anorexic actress, in this film it's Carly Pope, behaving like a ditz, don't bother.<br /><br />Carly Pope plays Sara Novak, a young college student, who becomes intrigued with a game of riddles, that leads her down into subway tunnels underneath the city - a dangerous thing for even a well-armed man to go in alone.<br /><br />There are various intrigues in the film -- a weirdo classmate who is apparently stalking Sara, a cynical shopkeeper who runs some kind of offbeat hole-in-the-wall establishment that appears to be located in the back alley of a ghetto, a nerdish dim-wit that hangs around the cynic's shop, and a woman named Emily Gray, who is back in prison.<br /><br />Sara's father is a lawyer who is handling Emily Gray's case. <br /><br />A few years back, Emily Gray attempted to drown a 12 year old boy. Emily was put in a mental hospital for 5 years, and for some cockeyed reason they let her out again, even though it is obvious she is still dangerously deranged.<br /><br />The only explanation Emily has ever given for her crime is: I never sinned.<br /><br />It's all part of the design.<br /><br />Well, my friend, don't expect to ever get any better explanation than that, because you won't.", "I was looking forward to this movie. Trustworthy actors, interesting plot. Great atmosphere then ????? IF you are going to attempt something that is meant to encapsulate the meaning of life. First. Know it. OK I did not expect the directors or writers to actually know the meaning but I thought they may have offered crumbs to peck at and treats to add fuel to the fire-Which! they almost did. Things I didn't get. A woman wandering around in dark places and lonely car parks alone-oblivious to the consequences. Great riddles that fell by the wayside. The promise of the knowledge therein contained by the original so-called criminal. I had no problem with the budget and enjoyed the suspense. I understood and can wax lyrical about the fool and found Adrian Pauls role crucial and penetrating and then ????? Basically the story line and the script where good up to a point and that point was the last 10 minutes or so. What? Run out of ideas! Such a pity that this movie had to let us down so badly. It may not comprehend the meaning and I really did not expect the writers to understand it but I was hoping for an intellectual, if not spiritual ride and got a bump in the road", 'Four things intrigued me as to this film - firstly, it stars Carly Pope (of "Popular" fame), who is always a pleasure to watch. Secdonly, it features brilliant New Zealand actress Rena Owen. Thirdly, it is filmed in association with the New Zealand Film Commission. Fourthly, a friend recommended it to me. However, I was utterly disappointed. The whole storyline is absurd and complicated, with very little resolution. Pope\'s acting is fine, but Owen is unfortunately under-used. The other actors and actresses are all okay, but I am unfamiliar with them all. Aside from the nice riddles which are littered throughout the movie (and Pope and Owen), this film isn\'t very good. So the moral of the story is...don\'t watch it unless you really want to.', '<br /><br />Never ever take a film just for its good looking title.<br /><br />Although it all starts well, the film suffers the same imperfections you see in B-films. Its like at a certain moment the writer does not any more how to end the film, so he ends it in a way nobody suspects it thinking this way he is ingenious.<br /><br />A film to be listed on top of the garbage list.<br /><br />', "Lowe returns to the nest after, yet another, failed relationship, to find he's been assigned to jury duty. It's in the plans to, somehow, get out of it, when he realizes the defendant is the girl he's had a serious crush on since the first grade.<br /><br />Through living in the past by telling other people about his feelings towards this girl (played by Camp), Lowe remembers those feelings and does everything in his power to clear Camp of attempted murder, while staying away from the real bad guys at the same time, and succeeding in creating a successful film at the same time.<br /><br />I've heard that St Augustine is the oldest city in the US, and I also know it has some ties to Ponce de Leon, so the backdrop is a good place to start. Unfortunately, it's the only thing good about this movie. The local police are inept, the judge is an idiot, and the defense counsel does everything in her power to make herself look like Joanie Cunningham! I don't know whether to blame the director for poor direction, or for just letting the cast put in such a hapless effort.<br /><br />In short, this movie was so boring, I could not even sleep through it! 1 out of 10 stars!", 'Seriously, I can\'t imagine how anyone could find a single flattering thing to say about this movie, much less find it in themselves to write the glowing compliments contained in this comment section. How many methamphetamines was Bogdonovitch on during the filming of this movie? Was he giving a bonus to the actor that spat his lines out with the most speed and least inflection or thought? The dialogue is bad, the plot atrocious, even for a "screwball" comedy, and claims that the movie is an homage to classic film comedy is about the most inane thing I\'ve ever heard. The cinematography is below the quality and innovation of that exhibited by the worst made-for-TV movies, the acting is awful (although I get the feeling that the fault for that lies squarely in the lap of the director), and speaking of which, did I mention the direction is so haphazard and inscrutable that it defies the definition of the word? The whole thing is a terribly unfunny (even in the much-beleaguered world of so-bad-it\'s-funny clunkers), soul-sucking, waste of two hours of your life that you\'ll never get back. Be afraid, be very afraid...'] tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
# Counter: subclass of Python's dictionary used for counting hashable objects, in this case, tokens (words).
# OrderedDict: subclass of Python's dictionary that remembers the insertion order of keys. It is used to store tokens in a specific order based on frequency.
from collections import Counter, OrderedDict
# re: A module for working with regular expressions, used to manipulate and clean text.
import re

# Step 1: Token counts and vocab creation
# Initializes an empty Counter object to hold the frequency of each token in the dataset.
token_counts = Counter()

# Define tokenizer
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)  # Remove HTML tags

    # This converts the entire text string to lowercase to ensure the regex matching is case-insensitive
    # Emoticons are case-insensitive, so text.lower is not necessary
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())  # Extract emoticons

    # \W a shorthand in regex that matches any non-word character. Replace occurrences with space.
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '') #adds emoticons to the cleaned text.

    #  creates a list of words (tokens)
    tokenized = text.split()

    return tokenized

# Step 2: Tokenize the training data and populate token_counts
for entry in train_dataset:  # Assuming train_dataset is a dataset with 'text'
    line = entry['text']
    tokens = tokenizer(line)
    token_counts.update(tokens)

# Step 3: Sort tokens by frequency
# token_counts.items() returns the tokens and their respective counts as a list of tuples (e.g., [(token1, count1), (token2, count2), ...])
# key=lambda x: x[1] means that the sorting is based on the count (x[1]), which is the second element of each tuple
# reverse=True means that the most frequent tokens appear first in the sorted list.
sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)

# Step 4: Limit the vocabulary to the top 69023 tokens (including special tokens)
# The padding token (pad) is used to ensure that all sequences in a batch have the same length.
# The unknown token (unk) is used to represent words that are not found in the model's vocabulary (the top 69021 words in your case).
# Any word that doesn't appear in the vocabulary is replaced by the unk token during tokenization.
# This is critical for handling unseen words during inference, where the model encounters words that were not present in the training data.
limited_sorted_by_freq_tuples = sorted_by_freq_tuples[:69021]  # Top 69021 + pad and unk

# Step 5: Create an ordered dictionary for the vocab
ordered_dict = OrderedDict(limited_sorted_by_freq_tuples)

# Step 6: Create vocab dictionary with special tokens
# Initializes the vocab dictionary with two special tokens
vocab = {"<pad>": 0, "<unk>": 1}

for idx, (token, count) in enumerate(ordered_dict.items(), start=2):  # Start from 2 to skip the special tokens
    vocab[token] = idx


# Print the vocabulary size (should be 69023)
print('Vocab-size:', len(vocab))

# --- Rationale:
#
# By assigning frequent words lower indices, we can optimize memory and computational efficiency.
# Words that appear infrequently can either be assigned higher indices (in case we want to keep them) or omitted from the vocabulary entirely.
Vocab-size: 69023
"""
from collections import OrderedDict

# Step 1: Sort the tokens by frequency and create an ordered dictionary
sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

# Step 2: Create the vocab dictionary manually
vocab = {"<pad>": 0, "<unk>": 1}
for idx, (token, count) in enumerate(ordered_dict.items(), start=2):
    vocab[token] = idx
"""
# Use the vocab to encode a list of tokens
def encode(tokens):
    #If the token does not exist in the vocab, the function returns the index of the <unk>
    return [vocab.get(token, vocab["<unk>"]) for token in tokens]
# Example usage
print(encode(['this', 'is', 'an', 'example']))  # Should output something like [11, 7, 35, 457]
[11, 7, 35, 457]
if not torch.cuda.is_available():
    print("Warning: this code may be very slow on CPU")
import torch
import torch.nn as nn
import re

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Use the manual vocab creation process from earlier
# Assuming `vocab` and `tokenizer` are already defined

#text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# Updated text pipeline
text_pipeline = lambda x: [vocab.get(token, vocab["<unk>"]) for token in tokenizer(x)]

# Manual label processing, see later
label_pipeline = lambda x: float(x)  # Convert to float to match the output


# Batch collation function
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []
    for entry in batch:  # Each 'entry' is a dictionary with 'text' and 'label'
        _label = entry['label']
        _text = entry['text']

        # Process labels and text
        label_list.append(label_pipeline(_label))  # Convert labels using label_pipeline
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)  # Convert text to indices

        # Store processed text and its length
        text_list.append(processed_text)
        lengths.append(processed_text.size(0))

    # Convert lists to tensors and pad sequences
    label_list = torch.tensor(label_list)
    lengths = torch.tensor(lengths)
    padded_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)

    return padded_text_list.to(device), label_list.to(device), lengths.to(device)
#-----  Example usage with DataLoader -----#
## Take a small batch

dataloader = DataLoader(train_dataset, batch_size=4, shuffle=False, collate_fn=collate_batch)
text_batch, label_batch, length_batch = next(iter(dataloader))

# Print the output batch
print("Text batch:", text_batch)
print("Label batch:", label_batch)
print("Length batch:", length_batch)
print("Text batch shape:", text_batch.shape)
Text batch: tensor([[   35,  1739,     7,   449,   721,     6,   301,     4,   787,     9,
             4,    18,    44,     2,  1705,  2460,   186,    25,     7,    24,
           100,  1874,  1739,    25,     7, 34415,  3568,  1103,  7517,   787,
             5,     2,  4991, 12401,    36,     7,   148,   111,   939,     6,
         11598,     2,   172,   135,    62,    25,  3199,  1602,     3,   928,
          1500,     9,     6,  4601,     2,   155,    36,    14,   274,     4,
         42945,     9,  4991,     3,    14, 10296,    34,  3568,     8,    51,
           148,    30,     2,    58,    16,    11,  1893,   125,     6,   420,
          1214,    27, 14542,   940,    11,     7,    29,   951,    18,    17,
         15994,   459,    34,  2480, 15211,  3713,     2,   840,  3200,     9,
          3568,    13,   107,     9,   175,    94,    25,    51, 10297,  1796,
            27,   712,    16,     2,   220,    17,     4,    54,   722,   238,
           395,     2,   787,    32,    27,  5236,     3,    32,    27,  7252,
          5118,  2461,  6390,     4,  2873,  1495,    15,     2,  1054,  2874,
           155,     3,  7015,     7,   409,     9,    41,   220,    17,    41,
           390,     3,  3925,   807,    37,    74,  2858,    15, 10297,   115,
            31,   189,  3506,   667,   163,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [  216,   175,   724,     5,    11,    18,    10,   226,   110,    14,
           182,    78,     8,    13,    24,   182,    78,     8,    13,   166,
           182,    50,   150,    24,    85,     2,  4031,  5935,   107,    96,
            28,  1867,   602,    19,    52,   162,    21,  1698,     8,     6,
          1181,   367,     2,   351,    10,   140,   419,     4,   333,     5,
          6022,  7136,  5055,  1209, 10892,    32,   219,     9,     2,   405,
          1413,    13,  4031,    13,  1099,     7,    85,    19,     2,    20,
          1018,     4,    85,   565,    34,    24,   807,    55,     5,    68,
           658,    10,   507,     8,     4,   668,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0],
        [   10,   121,    24,    28,    98,    74,   589,     9,   149,     2,
          7372,  3030, 14543,  1012,   520,     2,   985,  2327,     5, 16847,
          5479,    19,    25,    67,    76,  3478,    38,     2,  7372,     3,
            25,    67,    76,  2951,    34,    35, 10893,   155,   449, 29495,
         23725,    10,    67,     2,   554,    12, 14543,    67,    91,     4,
            50,    20,    19,     8,    67,    24,  4228,     2,  2142,    37,
            33,  3478,    87,     3,  2564,   160,   155,    11,   634,   126,
            24,   158,    72,   286,    13,   373,     2,  4804,    19,     2,
          7372,  6794,     6,    30,   128,    73,    48,    10,   886,     8,
            13,    24,     4,    85,    20,    19,     8,    13,    35,   218,
             3,   428,   710,     2,   107,   936,     7,    54,    72,   223,
             3,    10,    96,   122,     2,   103,    54,    72,    82,     2,
           658,   202,     2,   106,   293,   103,     7,  1193,     3,  3031,
           708,  5760,     3,  2918,  3991,   706,  3327,   349,   148,   286,
            13,   139,     6,     2,  1501,   750,    29,  1407,    62,    65,
          2612,    71,    40,    14,     4,   547,     9,    62,     8,  7943,
            71,    14,     2,  5687,     5,  4868,  3111,     6,   205,     2,
            18,    55,  2075,     3,   403,    12,  3111,   231,    45,     5,
           271,     3,    68,  1400,     7,  9774,   932,    10,   102,     2,
            20,   143,    28,    76,    55,  3810,     9,  2723,     5,    12,
            10,   379,     2,  7372,    15,     4,    50,   710,     8,    13,
            24,   887,    32,    31,    19,     8,    13,   428],
        [18923,     7,     4,  4753,  1669,    12,  3019,     6,     4, 13906,
           502,    40,    25,    77,  1588,     9,   115,     6, 21713,     2,
            90,   305,   237,     9,   502,    33,    77,   376,     4, 16848,
           847,    62,    77,   131,     9,     2,  1580,   338,     5, 18923,
            32,     2,  1980,    49,   157,   306, 21713,    46,   981,     6,
         10298,     2, 18924,   125,     9,   502,     3,   453,     4,  1852,
           630,   407,  3407,    34,   277,    29,   242,     2, 20200,     5,
         18923,    77,    95,    41,  1833,     6,  2105,    56,     3,   495,
           214,   528,     2,  3479,     2,   112,     7,   181,  1813,     3,
           597,     5,     2,   156,   294,     4,   543,   173,     9,  1562,
           289, 10038,     5,     2,    20,    26,   841,  1392,    62,   130,
           111,    72,   832,    26,   181, 12402,    15,    69,   183,     6,
            66,    55,   936,     5,     2,    63,     8,     7,    43,     4,
            78, 23726, 15995,    13,    20,    17,   800,     5,   392,    59,
          3992,     3,   371,   103,  2596,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0]],
       device='cuda:0')
Label batch: tensor([1., 1., 1., 0.], device='cuda:0')
Length batch: tensor([165,  86, 218, 145], device='cuda:0')
Text batch shape: torch.Size([4, 218])
#------------------------------------- TESTS -------------------------------------
# Check the first few labels in the dataset
for i in range(4):
    print(f"Label {i}: {train_dataset[i]['label']}")
Label 0: 1
Label 1: 1
Label 2: 1
Label 3: 0
# Debug tokenization and vocab lookup
sample_text = "This is an example sentence."
tokens = tokenizer(sample_text)
print("Tokens:", tokens)  # Should print tokenized text

token_indices = [vocab.get(token, vocab["<unk>"]) for token in tokens]
print("Token indices:", token_indices)  # Should print the vocab indices for each token
Tokens: ['this', 'is', 'an', 'example', 'sentence']
Token indices: [11, 7, 35, 457, 4063]
# Access both text and label for the first 5 entries
for i in range(5):
    entry = train_dataset[i]
    print(f"Text: {entry['text']}, Label: {entry['label']}")
Text: An extra is called upon to play a general in a movie about the Russian Revolution. However, he is not any ordinary extra. He is Serguis Alexander, former commanding general of the Russia armies who is now being forced to relive the same scene, which he suffered professional and personal tragedy in, to satisfy the director who was once a revolutionist in Russia and was humiliated by Alexander. It can now be the time for this broken man to finally "win" his penultimate battle. This is one powerful movie with meticulous direction by Von Sternberg, providing the greatest irony in Alexander's character in every way he can. Jannings deserved his Oscar for the role with a very moving performance playing the general at his peak and at his deepest valley. Powell lends a sinister support as the revenge minded director and Brent is perfect in her role with her face and movements showing so much expression as Jannings' love. All around brilliance. Rating, 10., Label: 1
Text: almost every review of this movie I'd seen was pretty bad. It's not pretty bad, it's actually pretty good, though not great. The Judy Garland character could have gotten annoying, but she didn't allow it to. Somewhere along the line, i've become a fan of brooding, overbearing, overacting Van Heflin, at least in the early 40's. Judy's singing is great, but the film missed a great chance by not showing more of their relationship. I gave it a 7., Label: 1
Text: I did not have too much interest in watching The Flock.Andrew Lau co-directed the masterpiece trilogy of Infernal Affairs but he had been fired from The Flock and he had been replaced by an emergency director called Niels Mueller.I had the feeling that Lau had made a good film but it had not satisfied the study,so they fired him and hired another director.This usually does not work well (let's remember The Invasion).But The Flock resulted to be better than what I expected.It's not a great film but it's an interesting and entertaining thriller.The character development is very well done and I could know the characters very well.Also,the relationship between the two main characters is natural and credible.Richard Gere and Claire Danes bring competent performances.Now,let's go to the negative points.One element which really bothered me (there was a moment in which it irritated me) was the excess of edition tricks to give the movie more "attitude" and style.That tricks feel out of place and their presence is arbitrary.Plus,I think the film should have been more ambitious.In spite of that,I recommend The Flock as a good thriller.It's not memorable at all,but it's entertaining., Label: 1
Text: Ulises is a literature teacher that arrives to a coastal town. There, he will fell in love to Martina, the most beautiful girl in town. They will start a torrid romance which will end in the tragic death of Ulises at the sea. Some years later, Martina has married to Sierra, the richest man in town and lives a quiet happy live surrounded by money. One day, the apparition of Ulises will make her passion to rise up and act without thinking the consequences. The plot is quite absurd and none of the actors plays a decent part. IN addition, three quarters of the film are sexual acts, which, still being well filmed, are quite tiring, as we want to see More development of the story. It is just a bad Bigas Luna's film, with lots of sex, no argument and stupid characters everywhere., Label: 0
Text: I found The FBI Story considerably entertaining and suitably upbeat for my New Years Day holiday viewing. Its drama and action-packed episodes were thrilling. The Hardesty character was well drawn and admirable. Overall the photography, script and direction was perfectly creditable. Rather than taking the film to be a repugnant piece of propaganda, as some might, I enjoyed it as a well mounted portrayal of the necessity of ingenious minds and brave bodies in the fight against crime. Again, the depiction of a family holding together even under the strain of the husband's commitment to his (arguably) important work, I did not find to be a twee representation but an ideal and exemplary one., Label: 1
#------------------------------------------------------------------------------
## Batching the datasets

batch_size = 32

train_dl = DataLoader(train_dataset, batch_size=batch_size,
                      shuffle=True, collate_fn=collate_batch)
valid_dl = DataLoader(valid_dataset, batch_size=batch_size,
                      shuffle=False, collate_fn=collate_batch)
test_dl = DataLoader(test_dataset, batch_size=batch_size,
                     shuffle=False, collate_fn=collate_batch)

22.3. Embedding layers for sentence encoding#

  • input_dim: number of words, i.e. maximum integer index + 1.

  • output_dim:

  • input_length: the length of (padded) sequence

    • for example, 'This is an example' -> [0, 0, 0, 0, 0, 0, 3, 1, 8, 9]
      => input_lenght is 10

  • When calling the layer, takes integr values as input,
    the embedding layer convert each interger into float vector of size [output_dim]

    • If input shape is [BATCH_SIZE], output shape will be [BATCH_SIZE, output_dim]

    • If input shape is [BATCH_SIZE, 10], output shape will be [BATCH_SIZE, 10, output_dim]

# URL of the image from GitHub
img_url = "https://raw.githubusercontent.com/cfteach/NNDL_DATA621/860d5f694ae8642fad21fba91e1a233e44a74881/DATA621/DATA621/images/Indexing_comments_IMDB.png"

# Display the image
Image(url=img_url, width = 600)
# The embedding layer in the code is used for converting integer indices
# (which represent words or tokens in the vocabulary) into dense vector representations,
# also known as embeddings

# ---> words with similar meanings or roles (based on the task) end up with similar vector representations

# e.g.,
embedding = nn.Embedding(num_embeddings=10, # This specifies the size of the vocabulary. In this example, it has 10 indices (from 0 to 9)
                         embedding_dim=3, # Each token in the vocabulary will be represented by a 3-dimensional vector.
                         padding_idx=0) # This is used to specify that index 0 is reserved for padding.
                                        #The embedding corresponding to the index 0 will be all zeros, and it will not be updated during training.
                                        # Padding is typically used when input sequences of varying lengths are padded with zeros
                                        # to make them of the same length.

# a batch of 2 samples of 4 indices each
text_encoded_input = torch.LongTensor([[1,2,4,5],[4,3,2,0]])
print(embedding(text_encoded_input))

# The embedding layer can be thought of as a parameterized layer in a neural network.
#  Unlike fully connected layers where you often apply activation functions, it is purely a lookup operation followed by gradient-based learning.
tensor([[[ 0.3430, -0.5329, -0.7423],
         [-0.3842,  0.4307, -0.5028],
         [ 0.5857, -0.2052,  2.7972],
         [ 1.0885,  0.5652,  0.2847]],

        [[ 0.5857, -0.2052,  2.7972],
         [ 0.4700,  1.9600, -0.3665],
         [-0.3842,  0.4307, -0.5028],
         [ 0.0000,  0.0000,  0.0000]]], grad_fn=<EmbeddingBackward0>)

22.4. Building an RNN model#

  • RNN layers:

    • nn.RNN(input_size, hidden_size, num_layers=1)

    • nn.LSTM(..)

    • nn.GRU(..)

    • nn.RNN(input_size, hidden_size, num_layers=1, bidirectional=True)

"""
## EXAMPLE
## with simple RNN layer

# Fully connected neural network with one hidden layer
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.rnn = nn.RNN(input_size, # specifies the number of features
                          hidden_size,
                          num_layers=2,
                          batch_first=True)
        #self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
        #self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        _, hidden = self.rnn(x)
        out = hidden[-1, :, :] # select last layer of hidden units, for all (:) sequencs in the batch and for all (:) features in the hidden unit
        out = self.fc(out)
        return out

model = RNN(64, 32)

print(model)


model(torch.randn(5, 3, 64)) # 5 batches, 3 times steps, 64 features
"""
'\n## An example of building a RNN model\n## with simple RNN layer\n\n# Fully connected neural network with one hidden layer\nclass RNN(nn.Module):\n    def __init__(self, input_size, hidden_size):\n        super().__init__()\n        self.rnn = nn.RNN(input_size, # specifies the number of features\n                          hidden_size, \n                          num_layers=2, \n                          batch_first=True)\n        #self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)\n        #self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)\n        self.fc = nn.Linear(hidden_size, 1)\n        \n    def forward(self, x):\n        _, hidden = self.rnn(x)\n        out = hidden[-1, :, :]\n        out = self.fc(out)\n        return out\n\nmodel = RNN(64, 32) \n\nprint(model) \n \n\nmodel(torch.randn(5, 3, 64)) # 5 batches, 3 times steps, 64 features\n'

22.5. Building an RNN model for the sentiment analysis task#

# In the following, the embedding layer creates the number of features that will be fed into the RNN

class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size,
                                      embed_dim,
                                      padding_idx=0)
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size,
                           batch_first=True)
        self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(out, lengths.cpu().numpy(), enforce_sorted=False, batch_first=True)
        out, (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out

vocab_size = len(vocab)
embed_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64

torch.manual_seed(1)
model = RNN(vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size)
model = model.to(device)
def train(dataloader):
    model.train()
    total_acc, total_loss = 0, 0
    for text_batch, label_batch, lengths in dataloader:
        optimizer.zero_grad()
        pred = model(text_batch, lengths)[:, 0]
        loss = loss_fn(pred, label_batch)
        loss.backward()
        optimizer.step()
        total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()
        total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)

def evaluate(dataloader):
    model.eval()
    total_acc, total_loss = 0, 0
    with torch.no_grad():
        for text_batch, label_batch, lengths in dataloader:
            pred = model(text_batch, lengths)[:, 0]
            loss = loss_fn(pred, label_batch)
            total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item()*label_batch.size(0)
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10

torch.manual_seed(1)

for epoch in range(num_epochs):
    acc_train, loss_train = train(train_dl)
    acc_valid, loss_valid = evaluate(valid_dl)
    print(f'Epoch {epoch} accuracy: {acc_train:.4f} val_accuracy: {acc_valid:.4f}')
Epoch 0 accuracy: 0.6120 val_accuracy: 0.6610
Epoch 1 accuracy: 0.7511 val_accuracy: 0.7520
Epoch 2 accuracy: 0.8004 val_accuracy: 0.7974
Epoch 3 accuracy: 0.8266 val_accuracy: 0.8144
Epoch 4 accuracy: 0.8768 val_accuracy: 0.8392
Epoch 5 accuracy: 0.8703 val_accuracy: 0.8274
Epoch 6 accuracy: 0.8986 val_accuracy: 0.8372
Epoch 7 accuracy: 0.9111 val_accuracy: 0.8424
Epoch 8 accuracy: 0.9399 val_accuracy: 0.8444
Epoch 9 accuracy: 0.9513 val_accuracy: 0.8568
acc_test, _ = evaluate(test_dl)
print(f'test_accuracy: {acc_test:.4f}')
test_accuracy: 0.8474

22.6. Select one random comment in the test dataset and compute prediction#

import random

# Function to preprocess and predict a single comment
def predict_comment(text, model, vocab):
    model.eval()

    # Tokenize and encode the input text using the same tokenizer and vocab as used during training
    tokens = tokenizer(text)
    encoded_text = [vocab.get(token, vocab["<unk>"]) for token in tokens]

    # Convert the tokens to tensor and add batch dimension
    text_tensor = torch.tensor(encoded_text).unsqueeze(0).to(device)  # Add batch dimension

    # Length tensor (since we are processing a single comment, length is just the length of the sequence)
    lengths_tensor = torch.tensor([len(encoded_text)]).to(device)

    with torch.no_grad():
        prediction = model(text_tensor, lengths_tensor)[:, 0]

    # Apply threshold of 0.5 for binary classification (as the model uses Sigmoid)
    prediction_label = 1 if prediction >= 0.5 else 0

    return prediction.item(), prediction_label

# Select a random sample from the test dataset
random_index = random.randint(0, len(test_dataset) - 1)
random_comment = test_dataset[random_index]['text']
random_label = test_dataset[random_index]['label']

# Make a prediction on this random comment
predicted_value, predicted_label = predict_comment(random_comment, model, vocab)

# Output the result
print(f"Random comment: {random_comment}")
print(f"True label: {random_label}")
print(f"Predicted value: {predicted_value:.4f}")
print(f"Predicted label: {predicted_label}")
Random comment: Los Angeles, 1976. Indie film brat John Carpenter, fresh out of film school and with one film - his class project's no-budget spoof of 2001 called Dark Star - under his belt, finishes a gritty actioner called Assault On Precinct 13. The story of an almost deserted police station under siege by an unseen LA gang, it was a minor hit on the drive-in circuit and garnered small praise from the few critics who cared, but it hardly set the film world on fire, unlike Carpenter's follow-up smash Halloween (1978). On Precinct, Carpenter was still learning how to exploit his almost non-existent budget by using lower-shelf actors, keeping the action to the one hellishly small location, and moving the film along at a tight pace with a combination of editing, intelligent camera work and switched-on genre savvy.<br /><br />No-one wants or needs to be hungry in Hollywood anymore, particularly if the week's catering bill on the 2005 version of Assault On Precinct 13 is more than the entire cost of the original. It does translate into a certain kind of laziness on a filmmaker's part - you have a stupidly large union crew, a studio and a marketing firm all doing your thinking for you. Which is why twenty years after watching Carpenter's film I can still see every glorious moment, from the small girl gunned down in cold blood while buying an ice cream, to the relentless pounding synth score. A week after Assault 2005, I remember Larry Fishburne's unmoving ping pong ball eyes and little else.<br /><br />"Forgettable popcorn actioner" fits the top of the poster perfectly. It's New Years Eve at Precinct 13, a station closing down with a skeleton staff to see in its final hours. On call is Jake Roenick (Ethan Hawke), an ex-narc now deeply troubled and hopped up on Jack Daniels and Seconol after his partners were iced in the opening scene; Iris (The Sopranos' Drea de Matteo), a nympho with a thing for criminal types, and Jasper (Brian Dennehy), a crusty old timer one scotch away from retirement. As in Carpenter's Assault..., a bus with four heavy-duty criminals is rerouted to the Precinct. All boozy eyes are on gangster kingpin Bishop (Fishburne, still beefed-up from his time in the Matrix) who has narrowly survived an assassination attempt from an undercover cop and plans to blow the lid on the endemic corruption in the organized crime unit led by Marcus Duvall (a tired-looking Gabriel Byrne). Soon the phones are out, the power lines are down, and both crims and police find themselves heavily armed with a serious police arsenal and consumed with paranoia while waging war against a task force of Duvall's corrupt cops sporting white balaclavas, bullet vests, infra-red bazookas and more high-tech gear than the Skywalker Ranch. This, we're expected to believe as the helicopters buzz around the top of the police station shooting rockets into windows, is a clandestine operation to cover Duvall's tracks. He may as well have taken out billboards on Hollywood Boulevard.<br /><br />As with the recent Seventies genre reworking Dawn Of The Dead, Assault 2005 takes the barest plot essentials of John Carpenter's original and, to quote the Seventies, "does it's own thing, man". The main question is - why bother? John Carpenter's 1976 is a cult favorite among genre buffs, but is hardly branded in the public's collective consciousness. Carpenter himself was busy reworking Howard Hawks' classic western Rio Bravo into a tight, claustrophobic urban thriller for only $20,000. French wunderkind director and rap producer Jean-Francois Richet, a self-professed fan of John Carpenter's work, seems less concerned with making an homage to either Hawks or JC - although the script is peppered with references to cowboys and injuns - and seems intent on squeezing in as much flash and firepower as the multi-million dollar budget can withstand. The result: some tense moments with hand-held POV cameras, an unexpectedly high (and bloody) body count, a few neat plot twists, but essentially a B-grade urban actioner with a much inflated price tag. As for name-checking Carpenter, it's pure conceit on the part of the filmmakers that doesn't pay off.<br /><br />To Monsieur Richet, I say bon voyage, and I wish you luck on your music career.
True label: 0
Predicted value: 0.0203
Predicted label: 0

22.7. Build a bi-directional LSTM#

Recall:

  • nn.RNN(input_size, hidden_size, num_layers=1)

  • nn.LSTM(..)

  • nn.GRU(..)

  • nn.RNN(input_size, hidden_size, num_layers=1, bidirectional=True)

Do you expect this to perform better?