How to Extract PDF Metadata in Python

Learn how to use pikepdf library to extract useful information from PDF files in Python.
  · 4 min read · Updated jun 2023 · PDF File Handling

Get a head start on your coding projects with our Python Code Generator. Perfect for those times when you need a quick solution. Don't wait, try it today!

The metadata in PDFs is useful information about the PDF document, it includes the title of the document, the author, last modification date, creation date, subject, and much more. Some PDF files got more information than others, and in this tutorial, you will learn how to extract PDF metadata in Python.

There are a lot of libraries and utilities in Python to accomplish the same thing but I like using pikepdf, as it's an active and maintained library. Let's install it:

$ pip install pikepdf

Pikepdf is a Pythonic wrapper around the C++ QPDF library. Let's import it in our script:

import pikepdf
import sys

We'll also use the sys module to get the filename from the command-line arguments:

# get the target pdf file from the command-line arguments
pdf_filename = sys.argv[1]

Let's load the PDF file using the library, and get the metadata:

# read the pdf file
pdf = pikepdf.Pdf.open(pdf_filename)
docinfo = pdf.docinfo
for key, value in docinfo.items():
    print(key, ":", value)

The docinfo attribute contains a dictionary of the document's metadata. Here is an example execution:

$ python extract_pdf_metadata_simple.py bert-paper.pdf

Output:

/Author : 
/CreationDate : D:20190528000751Z
/Creator : LaTeX with hyperref package
/Keywords :
/ModDate : D:20190528000751Z
/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
/Producer : pdfTeX-1.40.17
/Subject :
/Title :
/Trapped : /False

Related: How to Split PDF Files in Python.

Here is another PDF file:

$ python extract_pdf_metadata_simple.py python_cheat_sheet.pdf

Output:

/CreationDate : D:20201002181301Z
/Creator : wkhtmltopdf 0.12.5
/Producer : Qt 4.8.7
/Title : Markdown To PDF

As you can see, not all documents have the same fields, some contain much less information.

Notice that the /ModDate and /CreationDate are the last modification date and creation date respectively in the PDF datetime format. If you want to convert this format into Python datetime format, then I have copied this code from StackOverflow and edit it a little to run on Python 3:

import pikepdf
import datetime
import re
from dateutil.tz import tzutc, tzoffset
import sys

pdf_date_pattern = re.compile(''.join([
    r"(D:)?",
    r"(?P<year>\d\d\d\d)",
    r"(?P<month>\d\d)",
    r"(?P<day>\d\d)",
    r"(?P<hour>\d\d)",
    r"(?P<minute>\d\d)",
    r"(?P<second>\d\d)",
    r"(?P<tz_offset>[+-zZ])?",
    r"(?P<tz_hour>\d\d)?",
    r"'?(?P<tz_minute>\d\d)?'?"]))

def transform_date(date_str):
    """
    Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
    http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm
    (D:YYYYMMDDHHmmSSOHH'mm')
    :param date_str: pdf date string
    :return: datetime object
    """
    global pdf_date_pattern
    match = re.match(pdf_date_pattern, date_str)
    if match:
        date_info = match.groupdict()

        for k, v in date_info.items():  # transform values
            if v is None:
                pass
            elif k == 'tz_offset':
                date_info[k] = v.lower()  # so we can treat Z as z
            else:
                date_info[k] = int(v)

        if date_info['tz_offset'] in ('z', None):  # UTC
            date_info['tzinfo'] = tzutc()
        else:
            multiplier = 1 if date_info['tz_offset'] == '+' else -1
            date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))

        for k in ('tz_offset', 'tz_hour', 'tz_minute'):  # no longer needed
            del date_info[k]

        return datetime.datetime(**date_info)

# get the target pdf file from the command-line arguments
pdf_filename = sys.argv[1]
# read the pdf file
pdf = pikepdf.Pdf.open(pdf_filename)
docinfo = pdf.docinfo
for key, value in docinfo.items():
    if str(value).startswith("D:"):
        # pdf datetime format, convert to python datetime
        value = transform_date(str(pdf.docinfo["/CreationDate"]))
    print(key, ":", value)

Here is the same output previously, but with datetime formats converted to Python datetime objects:

/Author : 
/CreationDate : 2019-05-28 00:07:51+00:00
/Creator : LaTeX with hyperref package
/Keywords :
/ModDate : 2019-05-28 00:07:51+00:00
/PTEX.Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2
/Producer : pdfTeX-1.40.17
/Subject :
/Title :
/Trapped : /False

Get Our Practical Python PDF Processing EBook

Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!

Download EBook

Much better. I hope this quick tutorial helped you to get the metadata of PDF documents with Python.

Check the complete code here.

Here are some PDF-related tutorials:

For more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!

Learn also: How to Extract Image Metadata in Python

Happy coding ♥

Liked what you read? You'll love what you can learn from our AI-powered Code Explainer. Check it out!

View Full Code Analyze My Code
Sharing is caring!



Read Also



Comment panel

    Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!