Internet Movie Database (IMDb) is an online information base committed to a wide range of data about a wide scope of film substance, for example, movies, TV and web-based streaming shows, etc. The data which is introduced on the IMDb portal incorporates cast, creation group, director crew, individual accounts, plot outlines, random data, evaluations, fan, and critics reviews.
The IMDb dataset contains 50,000 surveys, permitting close to 30 audits for each film. It was developed in 2011 by the researchers: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts of Stanford University. The dataset was evenly divided into training and test sets. The training set contains 25000 reviews so as the test set.
A negative review has a score of ≤ 4 out of 10, and a positive survey has a score of ≥ 7 out of 10. Neutral reviews were excluded from this dataset.
Here, we will examine the information contained in this dataset, how it was gathered, and give some benchmark models that gave high accuracy on this dataset. Further, we will implement the IMDB dataset using Keras Library.
Data Collection
The raw data was collected by the researchers from the IMDb website. They searched the content information present in each of the reviews and discovered any highlights that were representative for judging whether the review was positive or negative. The reviews were then evenly divided into training and test sets uploaded to their website. In each of the directories contained in the sets, there are another two directories representing pos and neg tags, to partition the information through various marks. In every one of these folders, there are numerous TXT records containing the substance of the film survey, with each document containing one report.
Loading the dataset Using Pytorch
import os import glob import torch import torch.nn as nn from torch.autograd import Variable from torch import optim import torch.nn.functional as F from torchnlp.download import download_file_maybe_extract
Define the parameters that need to be passed to the function. The list x defined below will contain reviews with its polarity.
def imdb_dataset(directory='data/', train=False, test=False, train_directory='train', test_directory='test', extracted_name='aclImdb', check_files=['aclImdb/README'], url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', sentiments=['pos', 'neg']): download_file_maybe_extract(url=url, directory=directory, check_files=check_files) x= [] splits = [ dir_ for (requested, dir_) in [(train, train_directory), (test, test_directory)] if requested ] for split_directory in splits: full_path = os.path.join(directory, extracted_name, split_directory) examples = [] for sentiment in sentiments: for filename in glob.iglob(os.path.join(full_path, sentiment, '*.txt')): with open(filename, 'r', encoding="utf-8") as f: textnew = f.readline() examples.append({ 'text': textnew , 'sentiment': sentiment, }) list.append(examples) if len(x) == 1: return x[0] else: return tuple(x)
Code Implementation using Keras Library
The dataset can be downloaded from the following link.
Import all the libraries required for this project. from keras.datasets import imdb from keras.models import Sequential from keras.layers import Dense, LSTM from keras.layers.embeddings import Embedding from keras.preprocessing import sequence
Load the information from the IMDb dataset and split it into a train and test set. Ensure that the maximum number of words is 5000.
maximum_words = 5000 (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=maximum_words)
Let’s define the maximum length of the review. If the length of the review is more than 500, shorten it to maximum length. Suppose a review has a length shorter than 500 pad_sequence will add “0” to the remaining length.
For example “Bangalore 0 0 0 0”
max_review = 500 X_train = sequence.pad_sequences(X_train, maxlen=max_review) X_test = sequence.pad_sequences(X_test, maxlen=max_review)
We are adding the model=Sequential() line so that the data will flow from input to output in a sequence way. The Embedding layer turns each of the words into vectors of 32 digits.