How to Extract All PDF Links in Python

Learn how you can extract links and URLs from PDF files with Python using pikepdf and PyMuPDF libraries.
  · 3 min read · Updated jun 2023 · PDF File Handling

Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!

Do you want to extract the URLs that are in a specific PDF file ? If so, you're in the right place. In this tutorial, we will use pikepdf and PyMuPDF libraries in Python to extract all links from PDF files.

We will be using two methods to get links from a particular PDF file, the first is extracting annotations, which are markups, notes and comments, that you can actually click on your regular PDF reader and redirects to your browser, whereas the second is extracting all raw text and using regular expressions to parse URLs.

To get started, let's install these libraries:

pip3 install pikepdf PyMuPDF

Get: Practical Python PDF Processing EBook.

Method 1: Extracting URLs using Annotations

In this technique, we will use the pikepdf library to open a PDF file, iterate over all annotations of each page and see if there is a URL there:

import pikepdf # pip3 install pikepdf

file = "1810.04805.pdf"
# file = "1710.05006.pdf"
pdf_file = pikepdf.Pdf.open(file)
urls = []
# iterate over PDF pages
for page in pdf_file.pages:
    for annots in page.get("/Annots"):
        uri = annots.get("/A").get("/URI")
        if uri is not None:
            print("[+] URL Found:", uri)
            urls.append(uri)

print("[*] Total URLs extracted:", len(urls))

I'm testing on this PDF file, but feel free to use any PDF file of your choice, just make sure it has some clickable links.

After running that code, I get this output:

[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://gluebenchmark.com/faq
[+] URL Found: https://gluebenchmark.com/leaderboard
...<SNIPPED>...
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 30

Awesome, we have successfully extracted 30 URLs from that PDF paper.

Get Our Practical Python PDF Processing EBook

Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!

Download EBook

Method 2: Extracting URLs using Regular Expressions

In this section, we will extract all raw text from our PDF file and then we use regular expressions to parse URLs. First, let's get the text version of the PDF:

import fitz # pip install PyMuPDF
import re

# a regular expression of URLs
url_regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=\n]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
# extract raw text from pdf
file = "1710.05006.pdf"
# file = "1810.04805.pdf"
# open the PDF file
with fitz.open(file) as pdf:
    text = ""
    for page in pdf:
        # extract text of each PDF page
        text += page.getText()

Now text is the target string we want to parse URLs, let's use re module to parse them:

urls = []
# extract all urls using the regular expression
for match in re.finditer(url_regex, text):
    url = match.group()
    print("[+] URL Found:", url)
    urls.append(url)
print("[*] Total URLs extracted:", len(urls))

Output:

[+] URL Found: https://github.com/
[+] URL Found: https://github.com/tensor
[+] URL Found: http://nlp.seas.harvard.edu/2018/04/03/attention.html
[+] URL Found: https://gluebenchmark.com/faq.
[+] URL Found: https://gluebenchmark.com/leaderboard).
[+] URL Found: https://gluebenchmark.com/leaderboard
[+] URL Found: https://cloudplatform.googleblog.com/2018/06/Cloud-
[+] URL Found: https://gluebenchmark.com/
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 9

Conclusion

This time we only extract 9 URLs from that same PDF file, now this doesn't mean the second method is not accurate. This method parses only URLs that are in text form (not clickable).

However, there is a problem with this method, as URLs may contain newlines (\n), so you may want to allow that in url_regex expression.

So to conclude, if you want to get URLs that are clickable, you may want to use the first method, which is preferable. But if you want to get URLs that are in text form, the second may help you do that!

If you want to extract tables or images from PDF, there are tutorials for that:

Finally, for more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!

Learn alsoHow to Make an Email Extractor in Python.

Happy Coding ♥

Loved the article? You'll love our Code Converter even more! It's your secret weapon for effortless coding. Give it a whirl!

View Full Code Convert My Code
Sharing is caring!



Read Also



Comment panel

    Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!