Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!
Fake news is the intentional broadcasting of false or misleading claims as news, where the statements are purposely deceitful.
Newspapers, tabloids, and magazines have been supplanted by digital news platforms, blogs, social media feeds, and a plethora of mobile news applications. News organizations benefitted from the increased use of social media and mobile platforms by providing subscribers with up-to-the-minute information.
Consumers now have instant access to the latest news. These digital media platforms have increased in prominence due to their easy connectedness to the rest of the world and allow users to discuss and share ideas and debate topics such as democracy, education, health, research, and history. Fake news items on digital platforms are getting more popular and are used for profit, such as political and financial gain.
Because the Internet, social media, and digital platforms are widely used, anybody may propagate inaccurate and biased information. It is almost impossible to prevent the spread of fake news. There is a tremendous surge in the distribution of false news, which is not restricted to one sector such as politics but includes sports, health, history, entertainment, and science and research.
It is vital to recognize and differentiate between false and accurate news. One method is to have an expert decide, and fact checks every piece of information, but this takes time and needs expertise that cannot be shared. Secondly, we can use machine learning and artificial intelligence tools to automate the identification of fake news.
Online news information includes various unstructured format data (such as documents, videos, and audio), but we will concentrate on text format news here. With the progress of machine learning and Natural language processing, we can now recognize the misleading and false character of an article or statement.
Several studies and experiments are being conducted to detect fake news across all mediums.
Our main goal for this tutorial is:
Here is the table of content:
In this work, we utilized the fake news dataset from Kaggle to classify untrustworthy news articles as fake news. We have a complete training dataset containing the following characteristics:
id
: unique id for a news articletitle
: title of a news articleauthor
: author of the news articletext
: text of the article; could be incompletelabel
: a label that marks the article as potentially unreliable denoted by 1 (unreliable or fake) or 0 (reliable).It is a binary classification problem in which we must predict if a particular news story is reliable or not.
If you have a Kaggle account, you can simply download the dataset from the website there and extract the ZIP file.
I also uploaded the dataset into Google Drive, and you can get it here or use the gdown
library to download it in Google Colab or Jupyter notebooks automatically:
Unzipping the files:
Three files will appear in the current working directory: train.csv
, test.csv
, and submit.csv
, we will be using train.csv
in most of the tutorial.
Installing the required dependencies:
Note: If you're in a local environment, make sure you install PyTorch for GPU, head to this page for a proper installation.
Let's import the essential libraries for analysis:
The NLTK corpora and modules must be installed using the standard NLTK downloader:
The fake news dataset comprises various authors' original and fictitious article titles and text. Let's import our dataset:
Output:
Here's how the dataset looks:
Output:
We have 20,800 rows, which have five columns. Let's see some statistics of the text
column:
Output:
Stats for the title
column:
Output:
The statistics for the training and testing sets are as follows:
text
attribute has a higher word count with an average of 760 words and 75% having more than 1000 words.title
attribute is a short statement with an average of 12 words, and 75% of them are around 15 words.Our experiment would be with both text and title together.
Counting plots for both labels:
Output:
Output:
The number of untrustworthy articles (fake or 1) is 10413, while the number of trustworthy articles (reliable or 0) is 10387. Almost 50% of the articles are fake. Therefore, the accuracy metric will measure how well our model is doing when building a classifier.
In this section, we will clean our dataset to do some analysis:
In the code block above:
re
for regex.nltk.corpus
. When working with words, particularly when considering semantics, we sometimes need to eliminate common words that do not add any significant meaning to a statement, such as "but"
, "can"
, "we"
, etc.PorterStemmer
is used to perform stemming words with NLTK. Stemmers strip words of their morphological affixes, leaving the word stem solely.WordNetLemmatizer()
from NLTK library for lemmatization. Lemmatization is much more effective than stemming. It goes beyond word reduction and evaluates a language's whole lexicon to apply morphological analysis to words, with the goal of just removing inflectional ends and returning the base or dictionary form of a word, known as the lemma.stopwords.words('english')
allow us to look at the list of all the English stop words supported by NLTK.remove_unused_c()
function is used to remove the unused columns.None
using the null_process()
function.clean_dataset()
, we call remove_unused_c()
and null_process()
functions. This function is responsible for data cleaning.clean_text()
function.nltk_preprocess()
function for that purpose.Preprocessing the text
and title
:
Output:
In this section, we will perform:
The most frequent words appear in a bold and bigger font in a word cloud. This section will perform a word cloud for all words in the dataset.
The WordCloud library's wordcloud()
function will be used, and the generate()
is utilized for generating the word cloud image:
Output:
Word cloud for reliable news only:
Output:
Word cloud for fake news only:
Output:
An N-gram is a sequence of letters or words. A character unigram is made up of a single character, while a bigram comprises a series of two characters. Similarly, word N-grams are made up of a series of n words. The word "united" is a 1-gram (unigram). The combination of the words "united state" is a 2-gram (bigram), "new york city" is a 3-gram.
Let's plot the most common bigram on the reliable news:
The most common bigram on the fake news:
The most common trigram on reliable news:
For fake news now:
The above plots give us some ideas on how both classes look. In the next section, we'll use the transformers library to build a fake news detector.
This section will be grabbing code extensively from the fine-tuning BERT tutorial to make a fake news classifier using the transformers library. So, for more detailed information, you can head to the original tutorial.
If you didn't install transformers, you have to:
Let's import the necessary libraries:
We want to make our results reproducible even if we restart our environment:
The model we're going to use is the bert-base-uncased
:
Loading the tokenizer:
Let's now clean NaN
values from text
, author
, and title
columns:
Next, making a function that takes the dataset as a Pandas dataframe and returns the train/validation splits of texts and labels as lists:
The above function takes the dataset in a dataframe type and returns them as lists split into training and validation sets. Setting include_title
to True
means that we add the title
column to the text
we going to use for training, setting include_author
to True
means we add the author
to the text as well.
Let's make sure the labels and texts have the same length:
Output:
Learn also: Fine-tuning BERT for Semantic Textual Similarity with Transformers in Python.
Let's use the BERT tokenizer to tokenize our dataset:
Converting the encodings into a PyTorch dataset:
We'll be using BertForSequenceClassification
to load our BERT transformer model:
We set num_labels
to 2 since it's a binary classification. Below function is a callback to calculate the accuracy on each validation step:
Let's initialize the training parameters:
I've set the per_device_train_batch_size
to 10, but you should set it as high as your GPU could possibly fit. Setting the logging_steps
and save_steps
to 200, meaning we're going to perform evaluation and save the model weights on each 200 training step.
You can check this page for more detailed information about the available training parameters.
Let's instantiate the trainer:
Training the model:
The training takes a few hours to finish, depending on your GPU. If you're on the free version of Colab, it should take an hour with NVIDIA Tesla K80. Here is the output:
Since load_best_model_at_end
is set to True
, the best weights will be loaded when the training is completed. Let's evaluate it with our validation set:
Output:
Saving the model and the tokenizer:
A new folder containing the model configuration and weights will appear after running the above cell. If you want to perform prediction, you simply use the from_pretrained()
method we used when we loaded the model, and you're good to go.
Next, let's make a function that accepts the article text as an argument and return whether it's fake or not:
I've taken an example from test.csv
that the model never saw to perform inference, I've checked it, and it's an actual article from The New York Times:
The original text is in the Colab environment if you want to copy it, as it's a complete article. Let's pass it to the model and see the results:
Output:
In this section, we will predict all the articles in the test.csv
to create a submission file to see our accuracy in the test set on the Kaggle competition:
After we concatenate the author, title, and article text together, we pass the get_prediction()
function to the new column to fill the label
column, we then use to_csv()
method to create the submission file for Kaggle. Here's my submission score:
We got 99.78% and 100% accuracy on private and public leaderboards. That's awesome!
Alright, we're done with the tutorial. You can check this page to see various training parameters you can tweak.
If you have a custom fake news dataset for fine-tuning, you simply have to pass a list of samples to the tokenizer as we did, you won't change any other code after that.
Check the complete code here, or the Colab environment here.
Learn also: How to Perform Text Summarization using Transformers in Python
Happy learning ♥
Let our Code Converter simplify your multi-language projects. It's like having a coding translator at your fingertips. Don't miss out!
View Full Code Convert My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!