Want to code faster? Our Python Code Generator lets you create Python scripts with just a few clicks. Try it now!
Highlighting or annotating a text in a PDF file is a great strategy for reading and retaining key information. This technique can help in bringing important information immediately to the reader's attention. There is no doubt that a text highlighted in yellow would probably catch your eye first.
Redacting a PDF file allows you to hide sensitive information while keeping your document's formatting. This preserves private and confidential information before sharing. Moreover, it further boosts the organization's integrity and credibility in handling sensitive information.
Get: Practical Python PDF Processing EBook.
In this tutorial, you will learn how to redact, frame, or highlight text in PDF files using Python.
In this guide, we'll be using the PyMuPDF library, which is a highly versatile, customizable PDF, XPS, and EBook interpreter solution that can be used across a wide range of applications as a PDF renderer, viewer, or toolkit.
The goal of this tutorial is to develop a lightweight command-line-based utility to redact, frame, or highlight a text included in one PDF file or within a folder containing a collection of PDF files. Moreover, it will enable you to remove the highlights from a PDF file or a collection of PDF files.
Let's install the requirements:
Open up a new Python file, and let's get started:
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookThe extract_info()
function collects the metadata of a PDF file, the attributes that can be extracted are format
, title
, author
, subject
, keywords
, creator
, producer
, creation date
, modification date
, trapped
, encryption
, and the number of pages. It is worth noting that these attributes cannot be extracted when you target an encrypted PDF file.
This function searches for a string within the document lines using the re.findall()
function, re.IGNORECASE
is to ignore the case while searching.
This function performs the following:
You can change the color of the redaction using the fill
argument on the page.addRedactAnnot()
method, setting it to (0, 0, 0)
will result in a black redaction. These are RGB values ranging from 0 to 1. For example, (1, 0, 0)
will result in a red redaction, and so on.
The frame_matching_data()
function draws a red rectangle (frame) around the matching values.
Next, let's define a function to highlight text:
The above function applies the adequate highlighting mode on the matching values depending on the type of highlight inputted as a parameter.
You can always change the color of the highlight using the highlight.setColors()
method as shown in the comments.
Related: How to Extract Text from PDF in Python
The main purpose of the process_data()
function is the following:
"Redact"
, "Frame"
, "Highlight"
, etc.)It accepts several parameters:
input_file
: The path of the PDF file to process.output_file
: The path of the PDF file to generate after processing.search_str
: The string to search for.pages
: The pages to consider while processing the PDF file.action
: The action to perform on the PDF file.Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookNext, let's write a function to remove the highlight in case we want to:
The purpose of the remove_highlight()
function is to remove the highlights (not the redactions) from a PDF file. It performs the following:
Now let's make a wrapper function that uses previous functions to call the appropriate function depending on the action:
The action can be "Redact"
, "Frame"
, "Highlight"
, "Squiggly"
, "Underline"
, "Strikeout"
, and "Remove"
.
Let's define the same function but with folders that contain multiple PDF files:
This function is intended to process the PDF files included within a specific folder.
It loops throughout the files of the specified folder either recursively or not depending on the value of the parameter recursive and process these files one by one.
It accepts the following parameters:
input_folder
: The path of the folder containing the PDF files to process.search_str
: The text to search for in order to manipulate.recursive
: whether to run this process recursively by looping across the subfolders or not.action
: the action to perform among the list previously mentioned.pages
: the pages to consider.Before we make our main code, let's make a function for parsing command-line arguments:
Finally, let's write the main code:
Now let's test our program:
Output:
Before exploring our test scenarios, let me clarify few points:
PermissionError
, please close the input PDF file before running this utility."organi[sz]e"
match both "organise" and "organize".As a demonstration example, let’s highlight the word "BERT" in the BERT paper:
Output:
As you can see, 121 matches were highlighted, you can use other highlighting options, such as underline, frame, and others. Here is the resulting PDF:
Let's remove it now:
The resulting PDF will remove the highlighting.
I invite you to play around with other actions, as I find it quite interesting to do it automatically with Python.
If you want to highlight text from multiple PDF files, you can either specify the folder to the -i
parameter or merge the pdf files together and run the code to come up with a single PDF that has all the text you want to be highlighted.
I hope you enjoyed this article and found it interesting. Check the full code here.
Other related handling PDF tutorials:
For more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!
Happy coding ♥
Want to code smarter? Our Python Code Assistant is waiting to help you. Try it now!
View Full Code Convert My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!