Unlock the secrets of your code with our AI-powered Code Explainer. Take a look!
Transformer models have been showing incredible results in most of the tasks in the natural language processing field. The power of transfer learning combined with large-scale transformer language models has become a standard in state-of-the-art NLP.
One of the most significant milestones in the evolution of NLP is the release of Google's BERT model in late 2018, which is known as the beginning of a new era in NLP.
In this tutorial, we will take you through an example of fine-tuning BERT (and other transformer models) for text classification using the Huggingface Transformers library on the dataset of your choice.
Please note that this tutorial is about fine-tuning the BERT model on a downstream task (such as text classification). If you want to train BERT from scratch, that's called pre-training; this tutorial will definitely help you.
We'll be using the 20 newsgroups dataset as a demo for this tutorial; it is a dataset that has about 18,000 news posts on 20 different topics. If you have a custom dataset for classification, you can follow along as well, as you should make very few changes. For example, I've implemented this tutorial on fake news detection, and it works great.
You can also fine-tune BERT for any downstream task, such as semantic textual similarity, check this tutorial if you want to do that.
To get started, let's install the Huggingface transformers library along with others:
Open up a new notebook/Python file and import the necessary modules:
Next, let's make a function to set seed so we'll have the same results in different runs:
As mentioned earlier, we'll be using the BERT model. More specifically, we'll be using bert-base-uncased
pre-trained weights from the library. Again, if you wish to pre-train using your large dataset, this tutorial should help you do that.
Also, we'll be using max_length
of 512:
max_length
is the maximum length of our sequence. In other words, we'll be picking only the first 512 tokens from each document or post, and you can always change it to whatever you want. However, if you increase it, make sure it fits your memory during the training, even when using a smaller batch size.
Learn also: Conversational AI Chatbot with Transformers in Python.
Next, let's download and load the tokenizer responsible for converting our text to sequences of tokens:
We also set do_lower_case
to True
to make sure we lowercase all the text (remember, we're using the uncased model).
The below code downloads and loads the dataset:
Each of train_texts
and valid_texts
is a list of documents (list of strings) for training and validation sets, respectively, the same for train_labels
and valid_labels
, each of them is a list of integers or labels ranging from 0 to 19. target_names
is a list of our 20 labels each has its own name.
Now let's use our tokenizer to encode our corpus:
We set truncation
to True
so that we eliminate tokens that go above max_length
, we also set padding
to True
to pad documents that are less than max_length
with empty tokens.
The below code wraps our tokenized text data into a torch Dataset
:
Since we gonna use Trainer
from Transformers library, it expects our dataset as a torch.utils.data.Dataset
, so we made a simple class that implements the __len__()
method that returns the number of samples, and __getitem__()
method to return a data sample at a specific index.
Now that we have our data prepared, let's download and load our BERT model and its pre-trained weights:
We're using BertForSequenceClassification
class from Transformers library, we set num_labels
to the length of our available labels, in this case, 20.
We also cast our model to our CUDA GPU. If you're on CPU (not suggested), then just delete to()
method.
Before we start fine-tuning our model, let's make a simple function to compute the metrics we want. In this case, accuracy:
You're free to include any metric you want, I've included accuracy, but you can add precision, recall, etc.
The below code uses TrainingArguments
class to specify our training arguments, such as the number of epochs, batch size, and some other parameters:
Each argument is explained in the code comments. I've specified 8 as training batch size; that's because it's the maximum I can get to fit in a Google Colab environment's memory. If you have the CUDA out of memory error, make sure to decrease it furthermore. If you have a more powerful GPU in your environment, then increasing it will make the training significantly faster.
You can also tweak other parameters, such as increasing the number of epochs for better training.
I've set the logging_steps
and save_steps
to 400, which means it will evaluate and save the model after every 400 steps, make sure to increase it when you decrease the batch size lower than 8, that's because it'll save a lot of checkpoints after every few steps, and may take your whole environment disk space.
We then pass our training arguments, dataset, and compute_metrics
callback to our Trainer
:
Training the model:
This will take several minutes/hours depending on your environment, here's my output on Google Colab:
As you can see, the validation loss is gradually decreasing, and the accuracy increased to over 77.8%.
Remember we set load_best_model_at_end
to True
, this will automatically load the best-performed model when finished training, let's make sure with evaluate()
method:
This will take several seconds to output something like this:
Now that we trained our model, let's save it for inference later:
Now we have a trained model on our dataset, let's try to have some fun with it!
The below function takes a text as a string, tokenizes it with our tokenizer, calculates the output probabilities using softmax function, and returns the actual label:
Here's an example:
Output:
As expected, we're talking about Macbooks. Here's a second example:
Output:
This is a label of science -> space, as expected!
Yet another example:
Output:
In this tutorial, you've learned how you can train the BERT model using Huggingface Transformers library on your dataset.
Note that, you can also use other transformer models, such as GPT-2 with GPT2ForSequenceClassification
, RoBERTa with GPT2ForSequenceClassification
, DistilBERT with DistilBERTForSequenceClassification
, and much more. Please head to the official documentation for a list of available models.
Also, if your dataset is in a language other than English, make sure you pick the weights for your language, this will help a lot during training. Check this link and use the filter to get the model weights you need.
Learn also: How to Perform Text Summarization using Transformers in Python.
Finished reading? Keep the learning going with our AI-powered Code Explainer. Try it now!
View Full Code Improve My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!