Creating IMDb Movie Dataset With Python Implementation

Internet Movie Database (IMDb) is an online information base committed to a wide range of data about a wide scope of film substance, for example, movies, TV and web-based streaming shows, etc. The data which is introduced on the IMDb portal incorporates cast, creation group, director crew, individual accounts, plot outlines, random data, evaluations, fan, and critics reviews.

The IMDb dataset contains 50,000 surveys, permitting close to 30 audits for each film. It was developed in 2011 by the researchers: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts of Stanford University. The dataset was evenly divided into training and test sets. The training set contains 25000 reviews so as the test set.

A negative review has a score of ≤ 4 out of 10, and a positive survey has a score of ≥ 7 out of 10. Neutral reviews were excluded from this dataset.

Here, we will examine the information contained in this dataset, how it was gathered, and give some benchmark models that gave high accuracy on this dataset. Further, we will implement the IMDB dataset using Keras Library.

Data Collection

The raw data was collected by the researchers from the IMDb website. They searched the content information present in each of the reviews and discovered any highlights that were representative for judging whether the review was positive or negative. The reviews were then evenly divided into training and test sets uploaded to their website. In each of the directories contained in the sets, there are another two directories representing pos and neg tags, to partition the information through various marks. In every one of these folders, there are numerous TXT records containing the substance of the film survey, with each document containing one report.

Loading the dataset Using Pytorch

import os
import glob
import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torchnlp.download import download_file_maybe_extract

Define the parameters that need to be passed to the function. The list x defined below will contain reviews with its polarity.

def imdb_dataset(directory='data/',
                 train=False,
                 test=False,
                 train_directory='train',
                 test_directory='test',
                 extracted_name='aclImdb',
                 check_files=['aclImdb/README'],
                 url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz',
                 sentiments=['pos', 'neg']):
download_file_maybe_extract(url=url, directory=directory, check_files=check_files)
    x= []
    splits = [
        dir_ for (requested, dir_) in [(train, train_directory), (test, test_directory)]
        if requested
    ]
    for split_directory in splits:
        full_path = os.path.join(directory, extracted_name, split_directory)
        examples = []
        for sentiment in sentiments:
            for filename in glob.iglob(os.path.join(full_path, sentiment, '*.txt')):
                with open(filename, 'r', encoding="utf-8") as f:
                    textnew = f.readline()
                examples.append({
                    'text': textnew ,
                    'sentiment': sentiment,
                })
        list.append(examples)
   if len(x) == 1:
        return x[0]
     else:
        return tuple(x)

Code Implementation using Keras Library

The dataset can be downloaded from the following link.

Import all the libraries required for this project.
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

Load the information from the IMDb dataset and split it into a train and test set. Ensure that the maximum number of words is 5000.

maximum_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=maximum_words)

Let’s define the maximum length of the review. If the length of the review is more than 500, shorten it to maximum length. Suppose a review has a length shorter than 500 pad_sequence will add “0” to the remaining length.

For example “Bangalore 0 0 0 0”

max_review = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review)
X_test = sequence.pad_sequences(X_test, maxlen=max_review)

We are adding the model=Sequential() line so that the data will flow from input to output in a sequence way. The Embedding layer turns each of the words into vectors of 32 digits.

LSTM Layer decides which words in the reviews are important that will flow through them. We will add a Dense layer to the furthest limit of our model and utilize a sigmoid function capacity to deliver good results. The sigmoid function will choose if the data ought to be given a 1 (positive)or a – 1(negative).

embedding_vector_length = 32
model = Sequential()
model.add(Embedding(max_words, embedding_vector_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Next Step is to train the model with epoch=5 and batch size=64. Our model gave an accuracy of 92.88% on training data.

model.fit(X_train, y_train, epochs=5, batch_size=64)
scores = model.evaluate(X_test, y_test, verbose=0)
print("Model accuracy on the IMDb dataset: {0:.2f}%".format(scores[1]*100))

We finished with an accuracy of 87.25% on the test dataset.

State of the art

The present state of the art on IMDb dataset is NB-weighted-BON + dv-cosine . The model gave an exactness of 97.4%. Graph star and BERT large finetune UDA are near contenders with a precision of around 96%.

Conclusion

In this article, we have discussed the details and implementation of IMDb dataset using Keras Library. The model trained on the test data gave a decent accuracy of around 87%. Additionally, we can increase the accuracy by training the model with more number of epochs.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link