Sentiment Analysis Using Keras Embedding Layer in TensorFlow 2.0

Posted by Rahmad Sadli on January 25, 2020 in Deep Learning, Machine Learning, Natural Language Processing, Python Programming

Learn How to Solve Sentiment Analysis Problem With Keras Embedding Layer and Tensorflow

Introduction

Text classification, one of the fundamental tasks in Natural Language Processing, is a process of assigning predefined categories data to textual documents such as reviews, articles, tweets, blogs, etc.

One of the special cases of text classification is sentiment analysis.

The sentiment analysis is a process of gaining an understanding of the people’s or consumers’ emotions or opinions about a product, service, person, or idea. By understanding consumers’ opinions, producers can enhance the quality of their products or services to meet the needs of their customers.

Sentiment can be classified into binary classification (positive or negative), and multi-class classification (3 or more classes, e.g., negative, neutral and positive).

In this tutorial, we are going to learn how to perform a simple sentiment analysis using TensorFlow by leveraging Keras Embedding layer. For the purpose of this tutorial, we’re going to use a case of Amazon’s reviews.

This is the list what we are going to do in this tutorial:

Load the Amazon reviews data, then take randomly 20% of the data as our dataset. From this 20%, we’ll be dividing it again randomly to training data (70%) and validation data ( 30%).
Perform preprocessing including removing punctuation, numbers, and single characters; and converting the upper cases to the lower cases, so that the model can learn it easily.
Convert all text in corpus into sequences of words by using the Keras Tokenizer API.
Create and train a Deep Learning model to classify the sentiments using Keras Embedding layer.
Validate the model.

Here is a straightforward guide to implementing it. Let’s get started!.

Data preparation

Step1: Download the amazon reviews data from Kaggle

For the purpose of this tutorial, we’re going to use the Kaggle’s dataset of amazon reviews that can be downloaded from this link. If you want to work with google collab you can upload this dataset to your Google drive.

First of all, verify the installed TensorFlow 2.x in your colab notebook. If it exists, select it, otherwise upgrade TensorFlow.

try:
    %tensorflow_version 2.x
except:    
    !pip install --upgrade tensorflow

Then, mount your Google drive with the following code:

from google.colab import drive
drive.mount('/content/drive')

Run the code and your output will be something like this:

Click on the link provided as shown in the figure above, then authorize the connection, you will be given a code, copy and paste it to the box “Enter your authorization code:“, then press Enter. Now, you are normally in the Google drive directory.

Here is my Google drive, (just for example). I uploaded the file amazonreviews.zip to the NLP folder in my Google drive.

Point to the path where your amazonreviews.zip file is located. Mine is like in the following:

%cd drive/My\ Drive/NLP

Unzip the amazonreviews.zip file and decompress it.

!unzip amazonreviews.zip

The amazonreviews.zip file contains two compressed files, train.ft.txt.bz2 and test.ft.txt.bz2.

In this tutorial, we’re going to use only the train.ft.txt.bz2 file. So just decompress this file using the following command, then you will have a .txt file, that istrain.ft.txt.

!bzip2 -d train.ft.txt.bz2

Now, we’re going to open the train.ft.txt file. To do so, use the following code:

with open('train.ft.txt', 'r') as file:
    lines = file.readlines()

First, let’s take a look at the contents of the train.ft.txt file.

As you can observe from the above figure, the beginnings of the lines are the labels followed by the reviews. The file contains only two review labels, _label__2 and __label_1 for the positive and negative, respectively.

So far, we’re doing good. Let’s go ahead.

Step2: Data Preprocessing

Since our data source is data with .txt format, I prefer to convert it to a Pandas’ data frame. So, the first step of this data preparation is to convert the .txt data to the Pandas’ data frame format.

Converting Data to Pandas Data Frame

To do so, I will start it by importing Pandas and creating a Pandas’ data frame DF_text_data as follows:

# create a dataframe
import pandas as pd
DF_text_data = pd.DataFrame()

Now, we’re going to loop over the lines using the variable line. Then, we’ll separate the labels and the reviews from the line and store them to the Pandas’ data frame DF_text_data with different columns.

Anytime we loop over the lines, we convert text labels to numerical labels. Since this review is a binary case problem, i.e., negative and positive reviews, so we can easily convert these labels by replacing all the labels __label__2 to 1s and all the labels __label__1 to 0s.

Here is the code for doing this:

texts=[]
labels=[]
for line in lines:
    line=line.split()
    labels.append(1) if line[0] =="__label__2" else labels.append(0)
    texts.append(" ".join(line[1:]))

DF_text_data['reviews'] = texts
DF_text_data['labels'] = labels

If we print DF_text_data, you will see something like in the following figure. The data consists of 3 columns, they are indexes, reviews and labels.

As you can see, the index is started from 0 to 3.599.999, meaning this dataset contains 3.6M reviews and labels. This is a big dataset, by the way. If you have a good computer resource, you could just use them all, otherwise, we’ll be using a small part of it, let’s say 2 percent of it. To do so, check this code:

from sklearn import model_selection

_, X_data,_, y_data = \
    model_selection.train_test_split(DF_text_data['reviews'], 
                                     DF_text_data['labels'], test_size=0.02)

The X_data now only contains 72K reviews and labels.

Data cleaning

Before we can go deeper into analyzing, we need to do data cleaning, including removing punctuation, numbers, and single characters; and converting the upper cases to the lower cases, so that the model can learn the data easily.

The following is the function for this purpose:

def preprocess(in_text):
    # If we have html tags, remove them by this way:
    #out_text = remove_tags(in_text)

    # Remove punctuations and numbers
    out_text = re.sub('[^a-zA-Z]', ' ', in_text)
    # Convert upper case to lower case
    out_text="".join(list(map(lambda x:x.lower(),out_text)))
    # Remove single character
    out_text = re.sub(r"\s+[a-zA-Z]\s+", ' ', out_text)
    return out_text

Now, perform the preprocessing by calling the preprocess function.

#Performing preprocessing
import re

text_data=[]
for review in list(X_data):
    text_data.append(preprocess(review))

Create a new data frame to store a small part of the data that has been performed preprocessing.

DF_text= pd.DataFrame()
DF_text['reviews'] = text_data
DF_text['labels'] = list(y_data)

Now, we plot the data distribution for both classes. From the plot figure, we can see that the distribution of the data is almost the same portion for both negative and positive sentiments.

#Plot data distribution
import seaborn as sns
sns.countplot(x='labels', data=DF_text)

"""
If you use Anaconda with PyCharm uncomment these lines to show the figure.
"""
#import matplot.pyplot as as plt
#plt.show()

Now we’re going to divide our dataset into 70% as training and 30% as testing data.

X_train, X_test, y_train, y_test = \
    model_selection.train_test_split(DF_text['reviews'], 
                                     DF_text['labels'], test_size=0.30)

Convert them to the list array.

import numpy as np
X_train=np.array(X_train.values.tolist())
X_test=np.array(X_test.values.tolist())
y_train=np.array(y_train.values.tolist())
y_test=np.array(y_test.values.tolist())

Word Embeddings

In order to train our data, Deep learning model requires the numerical data as its input. Since we’re working on text classification, we need to translate our text data into numerical vectors. To do so, we’re going to use a method called word embeddings. This method encodes every word into an n-dimensional dense vector in which similar words will have similar encoding.

For this purpose, we’re going to use a Keras Embedding layer. Embedding layer can be used to learn both custom word embeddings and predefined word embeddings like GloVe and Word2Vec.

In this NLP tutorial, we’re going to use a Keras embedding layer to train our own custom word embedding model. The layer is initialized with random weights and is defined as the first hidden layer of a network.

The Embedding layer has 3 important arguments:

input_dim: Size of the vocabulary in the text data.
output_dim: Size of the vector space in which words will be embedded. This is a parameter that can be experimented for having a better performance. (ex: 32, 100, …)
input_length: Length of input sequences

Tokenizer

Before the data text can be fed to the Keras embedding layer, it must be encoded first, so that each word can be represented by a unique integer as required by the Embedding layer. To do this, Keras also provides a Tokenizer API that allows us to vectorize a text corpus into a sequence of integers.

The following is the code to do the tokenization. First, we create a Keras tokenizer object. Then, with this object, we can call the fit_on_texts function to fit the Keras tokenizer to the dataset.

from keras_preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
word_index=tokenizer.word_index
vocab_size = len(word_index)+1
print(vocab_size)

After fitting the tokenizer to the dataset, now we’re ready to convert our text to sequences by passing our data text to texts_to_sequences function.

This function tokenizes the input corpus into tokens of words where each of the word token is associated with a unique integer value.

We do it for both training and testing data.

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

Finally, we add padding to make all the vectors to have the same length maxlen.

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

maxlen = 100

X_train_pad = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test_pad = pad_sequences(X_test, padding='post', maxlen=maxlen)

Now, the data is ready to be feed to the model.

Create a Model

We are now ready to create the NN model. For this tutorial, we use a simple network, you can try to use a deeper network, or with different configuration such as using LSTM layer, and perform a comparison.

We create a sequential model with the embedding layer is the first layer, then followed by a GRU layer with dropout=0.2 and recurrent_dropout=0.2. and the last layer is a dense layer with the sigmoid activation function. We use sigmoid because we only have one output.

from tensorflow.keras.layers import Flatten, GRU, Dense, Flatten, Embedding
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(vocab_size, 20, input_length=maxlen))
model.add(GRU(units=32,dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

Compile the Model

To compile the model, we use Adam optimizer with binary_crossentropy.

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

Sentiment Analysis: Model Word embeddings

Train the Model

Now, it’s time to train the model.

This code below is used to train the model. We validate the model while training process.

model.fit(X_train_pad, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(X_test_pad, y_test))

After 10 epochs, the model achieves 86.66% of accuracy after epoch 10. Not bad.

Conclusion

In this article, we’ve built a simple model of sentiment analysis using custom word embeddings by leveraging the Keras API in TensorFlow 2.0.

Here are some remarks:

To do text classification, we need to do some data preprocessing, including removing punctuation, numbers, and single character and converting upper cases to lower cases, so that the computer can easily understand and enhance the accuracy rate.
A Deep learning model requires numerical data as its input. Therefore we need to convert our text data into numerical vectors. To do so, we use the word embeddings method.
Word embeddings are a way of representing words that can encode corpus text into numerical vector spaces in which similar words will have similar encoding.
It is considered the best available representation of words in NLP.

Next…

To explore further, in the next tutorial, we’re going to use two popular pre-trained word embeddings, GloVe and Word2Vec. So, see you in the next tutorial.

What others say

Nii Kwartei I says:

December 10, 2020 at 5:24 pm

Great write up!

Very simple, clear explanations. Thank you.

Reply