Struggling with multiple programming languages? No worries. Our Code Converter has got you covered. Give it a go!
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1993 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.
Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images, and other information needed to display it.
The familiarity of PDF led to its fast and widespread adoption as a solution in the field of digital archiving. Because PDFs are more versatile than other file formats, the information they display is easily viewable from almost any operating system or device.
In this tutorial, you will learn how to watermark a PDF file or a folder containing a collection of PDF files using PyPDF4 and reportlab in Python.
Download: Practical Python PDF Processing EBook.
First, let's install the necessary libraries:
$ pip install PyPDF4==1.27.0 reportlab==3.5.59
Starting with the code, let's import the libraries and define some configurations we'll need:
from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
import os
import argparse
from io import BytesIO
from typing import Tuple
# Import the reportlab library
from reportlab.pdfgen import canvas
# The size of the page supposedly A4
from reportlab.lib.pagesizes import A4
# The color of the watermark
from reportlab.lib import colors
PAGESIZE = A4
FONTNAME = 'Helvetica-Bold'
FONTSIZE = 40
# using colors module
# COLOR = colors.lightgrey
# or simply RGB
# COLOR = (190, 190, 190)
COLOR = colors.red
# The position attributes of the watermark
X = 250
Y = 10
# The rotation angle in order to display the watermark diagonally if needed
ROTATION_ANGLE = 45
Next, defining our first utility function:
def get_info(input_file: str):
"""
Extracting the file info
"""
# If PDF is encrypted the file metadata cannot be extracted
with open(input_file, 'rb') as pdf_file:
pdf_reader = PdfFileReader(pdf_file, strict=False)
output = {
"File": input_file, "Encrypted": ("True" if pdf_reader.isEncrypted else "False")
}
if not pdf_reader.isEncrypted:
info = pdf_reader.getDocumentInfo()
num_pages = pdf_reader.getNumPages()
output["Author"] = info.author
output["Creator"] = info.creator
output["Producer"] = info.producer
output["Subject"] = info.subject
output["Title"] = info.title
output["Number of pages"] = num_pages
# To Display collected metadata
print("## File Information ##################################################")
print("\n".join("{}:{}".format(i, j) for i, j in output.items()))
print("######################################################################")
return True, output
The get_info()
function collects the metadata of an input PDF file, the following attributes can be extracted: author, creator, producer, subject, title, and the number of pages.
It is worth noting that you cannot extract these attributes for an encrypted PDF file.
def get_output_file(input_file: str, output_file: str):
"""
Check whether a temporary output file is needed or not
"""
input_path = os.path.dirname(input_file)
input_filename = os.path.basename(input_file)
# If output file is empty -> generate a temporary output file
# If output file is equal to input_file -> generate a temporary output file
if not output_file or input_file == output_file:
tmp_file = os.path.join(input_path, 'tmp_' + input_filename)
return True, tmp_file
return False, output_file
The above function returns a path for a temporary output file when no output file is specified, or in case the paths of the input and output files are equal.
def create_watermark(wm_text: str):
"""
Creates a watermark template.
"""
if wm_text:
# Generate the output to a memory buffer
output_buffer = BytesIO()
# Default Page Size = A4
c = canvas.Canvas(output_buffer, pagesize=PAGESIZE)
# you can also add image instead of text
# c.drawImage("logo.png", X, Y, 160, 160)
# Set the size and type of the font
c.setFont(FONTNAME, FONTSIZE)
# Set the color
if isinstance(COLOR, tuple):
color = (c/255 for c in COLOR)
c.setFillColorRGB(*color)
else:
c.setFillColor(COLOR)
# Rotate according to the configured parameter
c.rotate(ROTATION_ANGLE)
# Position according to the configured parameter
c.drawString(X, Y, wm_text)
c.save()
return True, output_buffer
return False, None
This function performs the following:
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookNote that you can instead of using the drawString()
method to write text, you can use drawImage()
to draw an image, as written as a comment in the above function.
def save_watermark(wm_buffer, output_file):
"""
Saves the generated watermark template to disk
"""
with open(output_file, mode='wb') as f:
f.write(wm_buffer.getbuffer())
f.close()
return True
save_watermark()
saves the generated watermark template to a physical file in case you need to visualize it.
Now let's write the function that's responsible for adding a watermark to a given PDF file:
def watermark_pdf(input_file: str, wm_text: str, pages: Tuple = None):
"""
Adds watermark to a pdf file.
"""
result, wm_buffer = create_watermark(wm_text)
if result:
wm_reader = PdfFileReader(wm_buffer)
pdf_reader = PdfFileReader(open(input_file, 'rb'), strict=False)
pdf_writer = PdfFileWriter()
try:
for page in range(pdf_reader.getNumPages()):
# If required to watermark specific pages not all the document pages
if pages:
if str(page) not in pages:
continue
page = pdf_reader.getPage(page)
page.mergePage(wm_reader.getPage(0))
pdf_writer.addPage(page)
except Exception as e:
print("Exception = ", e)
return False, None, None
return True, pdf_reader, pdf_writer
This function aims to merge the inputted PDF file with the generated watermark. It accepts the following parameters:
input_file
: The path of the PDF file to watermark.wm_text
: The text to set as a watermark.pages
: The pages to watermark.It performs the following:
pdf_writer
object.def unwatermark_pdf(input_file: str, wm_text: str, pages: Tuple = None):
"""
Removes watermark from the pdf file.
"""
pdf_reader = PdfFileReader(open(input_file, 'rb'), strict=False)
pdf_writer = PdfFileWriter()
for page in range(pdf_reader.getNumPages()):
# If required for specific pages
if pages:
if str(page) not in pages:
continue
page = pdf_reader.getPage(page)
# Get the page content
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, pdf_reader)
# Loop through all the elements page elements
for operands, operator in content.operations:
# Checks the TJ operator and replaces the corresponding string operand (Watermark text) with ''
if operator == b_("Tj"):
text = operands[0]
if isinstance(text, str) and text.startswith(wm_text):
operands[0] = TextStringObject('')
page.__setitem__(NameObject('/Contents'), content)
pdf_writer.addPage(page)
return True, pdf_reader, pdf_writer
The purpose of this function is to remove the watermark text from the PDF file. It accepts the following parameters:
input_file
: The path of the PDF file to watermark.wm_text
: The text to set as a watermark.pages
: The pages to watermark.It performs the following:
TJ
and replaces the string (watermark text) following this operator.pdf_writer
object.def watermark_unwatermark_file(**kwargs):
input_file = kwargs.get('input_file')
wm_text = kwargs.get('wm_text')
# watermark -> Watermark
# unwatermark -> Unwatermark
action = kwargs.get('action')
# HDD -> Temporary files are saved on the Hard Disk Drive and then deleted
# RAM -> Temporary files are saved in memory and then deleted.
mode = kwargs.get('mode')
pages = kwargs.get('pages')
temporary, output_file = get_output_file(
input_file, kwargs.get('output_file'))
if action == "watermark":
result, pdf_reader, pdf_writer = watermark_pdf(
input_file=input_file, wm_text=wm_text, pages=pages)
elif action == "unwatermark":
result, pdf_reader, pdf_writer = unwatermark_pdf(
input_file=input_file, wm_text=wm_text, pages=pages)
# Completed successfully
if result:
# Generate to memory
if mode == "RAM":
output_buffer = BytesIO()
pdf_writer.write(output_buffer)
pdf_reader.stream.close()
# No need to create a temporary file in RAM Mode
if temporary:
output_file = input_file
with open(output_file, mode='wb') as f:
f.write(output_buffer.getbuffer())
f.close()
elif mode == "HDD":
# Generate to a new file on the hard disk
with open(output_file, 'wb') as pdf_output_file:
pdf_writer.write(pdf_output_file)
pdf_output_file.close()
pdf_reader.stream.close()
if temporary:
if os.path.isfile(input_file):
os.replace(output_file, input_file)
output_file = input_file
The above function accepts several parameters:
input_file
: The path of the PDF file to watermark.wm_text
: The text to set as a watermark.action
: The action to perform whether to watermark or to un-watermark file.mode
: The location of the temporary file whether to memory or hard disk.pages
: The pages to watermark.watermark_unwatermark_file()
function calls the previously defined functions watermark_pdf()
or unwatermark_pdf()
depending on the chosen action
.
Based on the selected mode
, and if the output file has a similar path as the input file or no output file is specified, then a temporary file will be created in case the selected mode
is HDD
(hard disk drive).
Next, let's add the ability to add or remove watermark from a folder containing multiple PDF files:
def watermark_unwatermark_folder(**kwargs):
"""
Watermarks all PDF Files within a specified path
Unwatermarks all PDF Files within a specified path
"""
input_folder = kwargs.get('input_folder')
wm_text = kwargs.get('wm_text')
# Run in recursive mode
recursive = kwargs.get('recursive')
# watermark -> Watermark
# unwatermark -> Unwatermark
action = kwargs.get('action')
# HDD -> Temporary files are saved on the Hard Disk Drive and then deleted
# RAM -> Temporary files are saved in memory and then deleted.
mode = kwargs.get('mode')
pages = kwargs.get('pages')
# Loop though the files within the input folder.
for foldername, dirs, filenames in os.walk(input_folder):
for filename in filenames:
# Check if pdf file
if not filename.endswith('.pdf'):
continue
# PDF File found
inp_pdf_file = os.path.join(foldername, filename)
print("Processing file:", inp_pdf_file)
watermark_unwatermark_file(input_file=inp_pdf_file, output_file=None,
wm_text=wm_text, action=action, mode=mode, pages=pages)
if not recursive:
break
This function loops throughout the files of the specified folder either recursively or not depending on the value of the recursive
parameter, and processes these files one by one.
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookNext, let's make a utility function to check whether a path is a file path or a directory path:
def is_valid_path(path):
"""
Validates the path inputted and checks whether it is a file path or a folder path
"""
if not path:
raise ValueError(f"Invalid Path")
if os.path.isfile(path):
return path
elif os.path.isdir(path):
return path
else:
raise ValueError(f"Invalid Path {path}")
Now that we have all the functions we need for this tutorial, let's make a final one for parsing command-line arguments:
def parse_args():
"""
Get user command line parameters
"""
parser = argparse.ArgumentParser(description="Available Options")
parser.add_argument('-i', '--input_path', dest='input_path', type=is_valid_path,
required=True, help="Enter the path of the file or the folder to process")
parser.add_argument('-a', '--action', dest='action', choices=[
'watermark', 'unwatermark'], type=str, default='watermark',
help="Choose whether to watermark or to unwatermark")
parser.add_argument('-m', '--mode', dest='mode', choices=['RAM', 'HDD'], type=str,
default='RAM', help="Choose whether to process on the hard disk drive or in memory")
parser.add_argument('-w', '--watermark_text', dest='watermark_text',
type=str, required=True, help="Enter a valid watermark text")
parser.add_argument('-p', '--pages', dest='pages', type=tuple,
help="Enter the pages to consider e.g.: [2,4]")
path = parser.parse_known_args()[0].input_path
if os.path.isfile(path):
parser.add_argument('-o', '--output_file', dest='output_file',
type=str, help="Enter a valid output file")
if os.path.isdir(path):
parser.add_argument('-r', '--recursive', dest='recursive', default=False, type=lambda x: (
str(x).lower() in ['true', '1', 'yes']), help="Process Recursively or Non-Recursively")
# To Porse The Command Line Arguments
args = vars(parser.parse_args())
# To Display The Command Line Arguments
print("## Command Arguments #################################################")
print("\n".join("{}:{}".format(i, j) for i, j in args.items()))
print("######################################################################")
return args
Below are the arguments defined:
input_path
: A required parameter to input the path of the file or the folder to process, this parameter is associated with the is_valid_path()
function previously defined.action
: The action to perform, which is either watermark
or unwatermark
the PDF file, default is to watermark.mode
: To specify the destination of the generated temporary file, whether to memory or hard disk.watermark_text
: The string to set as a watermark.pages
: The pages to watermark (e.g first page is [0]
, the second page and the fourth page is [1, 3]
, etc.). If not specified, then all pages.output_file
: The path of the output file.recursive
: Whether to process a folder recursively.Now that we have everything, let's write the main code to execute based on parameters passed:
if __name__ == '__main__':
# Parsing command line arguments entered by user
args = parse_args()
# If File Path
if os.path.isfile(args['input_path']):
# Extracting File Info
get_info(input_file=args['input_path'])
# Encrypting or Decrypting a File
watermark_unwatermark_file(
input_file=args['input_path'], wm_text=args['watermark_text'], action=args[
'action'], mode=args['mode'], output_file=args['output_file'], pages=args['pages']
)
# If Folder Path
elif os.path.isdir(args['input_path']):
# Encrypting or Decrypting a Folder
watermark_unwatermark_folder(
input_folder=args['input_path'], wm_text=args['watermark_text'],
action=args['action'], mode=args['mode'], recursive=args['recursive'], pages=args['pages']
)
Related: How to Extract Images from PDF in Python.
Now let's test our program, if you open up a terminal window and type:
$ python pdf_watermarker.py --help
It'll show the parameters defined and their respective description:
usage: pdf_watermarker.py [-h] -i INPUT_PATH [-a {watermark,unwatermark}] [-m {RAM,HDD}] -w WATERMARK_TEXT [-p PAGES]
Available Options
optional arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
Enter the path of the file or the folder to process
-a {watermark,unwatermark}, --action {watermark,unwatermark}
Choose whether to watermark or to unwatermark
-m {RAM,HDD}, --mode {RAM,HDD}
Choose whether to process on the hard disk drive or in memory
-w WATERMARK_TEXT, --watermark_text WATERMARK_TEXT
Enter a valid watermark text
-p PAGES, --pages PAGES
Enter the pages to consider e.g.: [2,4]
Now let's watermark my CV as an example:
$ python pdf_watermarker.py -a watermark -i "CV Bassem Marji_Eng.pdf" -w "CONFIDENTIAL" -o CV_watermark.pdf
The following output will be produced:
## Command Arguments #################################################
input_path:CV Bassem Marji_Eng.pdf
action:watermark
mode:RAM
watermark_text:CONFIDENTIAL
pages:None
output_file:CV_watermark.pdf
######################################################################
## File Information ##################################################
File:CV Bassem Marji_Eng.pdf
Encrypted:False
Author:TMS User
Creator:Microsoft® Word 2013
Producer:Microsoft® Word 2013
Subject:None
Title:None
Number of pages:1
######################################################################
And that's how the output CV_watermark.pdf
file looks like:
Now let's remove the watermark added:
$ python pdf_watermarker.py -a unwatermark -i "CV_watermark.pdf" -w "CONFIDENTIAL" -o CV.pdf
This time the watermark will be removed and saved into a new PDF file CV.pdf
.
You can also set the -m
and -p
for mode and pages respectively. You can also watermark a list of PDF files located under a specific path:
$ python pdf_watermarker.py -i "C:\Scripts\Test" -a "watermark" -w "CONFIDENTIAL" -r False
Or removing the watermark from a bunch of PDF files:
$ python pdf_watermarker.py -i "C:\Scripts\Test" -a "unwatermark" -w "CONFIDENTIAL" -m HDD -p[0] -r False
In this tutorial, you've learned how to add and remove watermarks from PDF files using reportlab and PyPDF4 libraries in Python. I hope this article helped you assimilate this cool feature.
Here are some other PDF-related tutorials:
For a complete list of PDF tutorials, check this page. Finally, unlock the secrets of Python PDF manipulation! Our compelling Practical Python PDF Processing eBook offers exclusive, in-depth guidance you won't find anywhere else. If you're passionate about enriching your skill set and mastering the intricacies of PDF handling with Python, your journey begins with a single click right here. Let's explore together!
Learn also: How to Extract All PDF Links in Python.
Happy coding ♥
Why juggle between languages when you can convert? Check out our Code Converter. Try it out today!
View Full Code Fix My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!