IMDB Movie Reviews in R

  • Post by Hieu Nguyen Phi, FRM
  • Apr 19, 2019
post-thumb

Two-class classification, or binary classification, may be the most widely applied kind of machine-learning problem. This post will show how to classify movie reviews as positive or negative, based on the text content of the reviews.

IMDB dataset

Our data is derived from the Internet Movie Database with a set of \(50000\) highly polarized reviews. They’re split into \(25000\) reviews for training and \(25000\) reviews for testing, each set consisting of \(50\%\) negative and \(50\%\) positive reviews. The IMDB dataset comes packaged with Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

The following code will load the dataset:

imdb <- dataset_imdb(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% imdb

You can’t feed lists of integers into a neural network. You have to turn your lists into tensors. There are two ways to do that:

  • Pad your lists so that they all have the same length, turn them into an integer tensor of shape (samples, word_indices), and then use as the first layer in your network a layer capable of handling such integer tensors.
  • One-hot-encode your lists to turn them into vectors of \(0\)s and \(1\)s. This would mean, for instance, turning these quence \([3, 5]\) into a \(10000\)-dimensional vector that would be all zeros except for indices \(3\) and \(5\), which would be ones. Then you could use as the first layer in your network a dense layer, capable of handling floating-point vector data.

Let’s go with the latter solution and vectorize the data, which you’ll do manually for maximum clarity.

vectorize_sequences <- function(sequences, dimension = 10000) {
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  for (i in 1:length(sequences))
    results[i, sequences[[i]]] <- 1
  results
}

x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

You should also convert your labels from integer to numeric, which is straightforward:

y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)

Building network

The input data is vectors, and the labels are scalars (1s and 0s): this is the easiest setup you’ll ever encounter. A type of network that performs well on such a problem is a simple stack of fully connected (dense) layers with relu activations: layer_dense(units = 16, activation = "relu").

The argument being passed to each dense layer (16) is the number of hidden units of the layer. A hidden unit is a dimension in the representation space of the layer. You can intuitively understand the dimensionality of your representation space as “how much freedom you’re allowing the network to have when learning internal representations.” Having more hidden units (a higher-dimensional representation space) allows your network to learn more complex representations, but it makes the network more computationally expensive and may lead to learning unwanted patterns (patterns that will improve performance on the training data but not on the test data).

There are two key architecture decisions to be made about such stack of dense layers:

  • How many layers to use
  • How many hidden units to choose for each layer

For this problem, we will go with two intermediate layers with 16 hidden units each, and a third layer that will output the scalar prediction regarding the sentiment of the current review.

The intermediate layers will use relu as their activation function, and the final layer will use a sigmoid activation so as to output a probability (a score between \(0\) and \(1\), indicating how likely the sample is to have the target “\(1\)”: that is, how likely the review is to be positive). A relu (rectified linear unit) is a function meant to zero-out negative values, whereas a sigmoid “squashes” arbitrary values into the [0, 1] interval, outputting something that can be interpreted as a probability.

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

Finally, you need to choose a loss function and an optimizer. Because you’re facing a binary classification problem and the output of your network is a probability (you end your network with a single-unit layer with a sigmoid activation), it’s best to use the binary_crossentropy loss. It isn’t the only viable choice: you could use, for instance, mean_squared_error. But crossentropy is usually the best choice when you’re dealing with models that output probabilities. Crossentropy is a quantity from the field of Information Theory that measures the distance between probability distributions or, in this case, between the ground-truth distribution and your predictions.

Here’s the step where you configure the model with the rmsprop optimizer and the binary_crossentropy loss function. Note that you’ll also monitor accuracy during training.

You’re passing your optimizer, loss function, and metrics as strings, which is possible because rmsprop, binary_crossentropy, and accuracy are packaged as part of Keras. Sometimes you may want to configure the parameters of your optimizer or pass a custom loss function or metric function. The former can be done by passing an optimizer instance as the optimizer argument; the latter can be done by passing function objects as the loss and/or metrics arguments.

model %>% compile(
  optimizer = optimizer_rmsprop(lr=0.001),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

Validating approach

In order to monitor during training the accuracy of the model on data it has never seen before, you’ll create a validation set by setting apart 10,000 samples from the original training data.

val_indices <- 1:10000
x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]

You’ll now train the model for \(20\) epochs (\(20\) iterations over all samples in the x_train and y_train tensors), in mini-batches of \(512\) samples. At the same time, you’ll monitor loss and accuracy on the \(10000\) samples that you set apart. You do so by passing the validation data as the validation_data argument.

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size = 512,
  validation_data = list(x_val, y_val)
)

The history object has a plot() method that enables us to visualize the training and validation metrics by epoch:

plot(history)

As you can see, the training loss decreases with every epoch, and the training accuracy increases with every epoch. That’s what you would expect when running a gradient-descent optimization—the quantity you’re trying to minimize should be less with every iteration. But that isn’t the case for the validation loss and accuracy: they seem to peak at the fourth epoch. This is an example of what we warned against: a model that performs better on the training data isn’t necessarily a model that will do better on data it has never seen before. In precise terms, what you’re seeing is overfitting: after the second epoch, you’re over-optimizing on the training data, and you end up learning representations that are specific to the training data and don’t generalize to data outside of the training set.

In this case, to prevent overfitting, you could stop training after three epochs. Let’s train a new network from scratch for four epochs and then evaluate it on the test data.

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)
model %>% fit(x_train, y_train, epochs = 4, batch_size = 512)
results <- model %>% evaluate(x_test, y_test)

The final results are as follows:

results
## $loss
## [1] 0.2960388
## 
## $acc
## [1] 0.88164

Using a trained network to generate predictions on new data

After having trained a network, you’ll want to use it in a practical setting. You can generate the likelihood of reviews being positive by using the predict method:

model %>% predict(x_test[1:10,])
##             [,1]
##  [1,] 0.25151318
##  [2,] 0.99993598
##  [3,] 0.92726833
##  [4,] 0.89425069
##  [5,] 0.97707933
##  [6,] 0.85299319
##  [7,] 0.99980563
##  [8,] 0.02103745
##  [9,] 0.97155803
## [10,] 0.99550676

Further experiments

The following experiments will help convince you that the architecture choices you’ve made are all fairly reasonable, although they can still be improved:

  • You used two hidden layers. Try using one or three hidden layers, and see how doing so affects validation and test accuracy.
  • Try using layers with more hidden units or fewer hidden units: 32 units, 64 units, and so on.
  • Try using the mse loss function instead of binary_crossentropy.
  • Try using the tanh activation (an activation that was popular in the early days of neural networks) instead of relu.