Juggling between coding languages? Let our Code Converter help. Your one-stop solution for language conversion. Start now!
Email spam or junk email is unsolicited, unavoidable, and repetitive messages sent in email. Email spam has grown since the early 1990s, and by 2014, it was estimated that it made up around 90% of email messages sent.
Since we all have the problem of spam emails filling our inboxes, in this tutorial, we gonna build a model in Keras that can distinguish between spam and legitimate emails.
Table of contents:
We first need to install some dependencies:
Now open up an interactive shell or a Jupyter notebook and import:
Let's define some hyper-parameters:
Don't worry if you are not sure what these parameters mean, we'll talk about them later when we construct our model.
The dataset we gonna use is SMS Spam Collection Dataset, download, extract and put it in a folder called "data", let's define the function that loads it:
The dataset is in a single file, each line corresponds to a data sample, the first word is the label and the rest is the actual email content, that's why we are grabbing labels as split[0] and the content as split[1:].
Calling the function:
Now, we need a way to vectorize the text corpus by turning each text into a sequence of integers, you're now may be wondering why we need to turn the text into a sequence of integers. Well, remember we are going to feed the text into a neural network, a neural network only understands numbers. More precisely, a fixed-length sequence of integers.
But before we do all of that, we need to clean this corpus by removing punctuations, lowercase all characters, etc. Luckily for us, Keras has a built-in class Tokenizer
from the tensorflow.keras.preprocessing.text
module, that does all that in few lines of code:
Let's try to print the first sample:
A bunch of numbers, each integer corresponds to a word in the vocabulary, that's what the neural network needs anyway. However, the samples don't have the same length, we need a way to have a fixed-length sequence.
As a result, we're using the pad_sequences()
function from the tensorflow.keras.preprocessing.sequence
module that pad sequences at the beginning of each sequence with zeros:
As you may remember, we set SEQUENCE_LENGTH
to 100, in this way, all sequences have a length of 100. Let's print how each sentence is converted to:
Now our labels are text also, but we gonna make a different approach here, since the labels are only "spam" and "ham", we need to one-hot encode them:
We used keras.utils.to_categorial()
here, which does what its name suggests, let's try to print the first sample of the labels:
That means the first sample is ham.
Next, let's shuffle and split training and testing data:
Cell output:
As you can see, we have a total of 4180 training samples and 1494 validation samples.
Now we are ready to build our model, the general architecture is as shown in the following image:
The first layer is a pre-trained embedding layer that maps each word to an N-dimensional vector of real numbers (the EMBEDDING_SIZE corresponds to the size of this vector, in this case, 100). Two words that have similar meanings tend to have very close vectors.
The second layer is a recurrent neural network with LSTM units. Finally, the output layer is 2 neurons each corresponds to "spam" or "ham" with a softmax activation function.
Let's start by writing a function to load the pre-trained embedding vectors:
Note: In order to run this function properly, you need to download GloVe, extract and put in the "data" folder, we will use the 100-dimensional vectors here.
Let's define the function that builds the model:
The above function constructs the whole model, we loaded the pre-trained embedding vectors to the Embedding
layer, and set trainable=False
, this will freeze the embedding weights during the training process.
After we add the RNN layer, we added a 30% dropout chance, this will freeze 30% of neurons in the previous layer in each iteration which will help us reduce overfitting.
Note that accuracy isn't enough for determining whether the model is doing great, that is because this dataset is unbalanced, only a few samples are spam. As a result, we will use precision and recall metrics.
Let's call the function:
We are almost there, we gonna need to train this model with the data we just loaded:
The training has started:
The training is finished:
Let's evaluate our model:
Output:
Here are what each metric means:
Great! let's test this out:
Let's fake a spam email:
Output:
Okay, let's try to be legitimate:
Output:
Awesome! This approach is the current state-of-the-art, try to tune the training and model parameters and see if you can improve it.
To see various metrics during training, we need to go to tensorboard by typing in cmd or terminal:
Go to the browser and type "localhost:6006" and go to various metrics, here is my result:
Here are some further readings:
I encourage you to check the full code.
Read Also: How to Perform Text Classification in Python using Tensorflow 2 and Keras.
Happy Coding ♥
Save time and energy with our Python Code Generator. Why start from scratch when you can generate? Give it a try!
View Full Code Explain My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!