Struggling with multiple programming languages? No worries. Our Code Converter has got you covered. Give it a go!
Do you want to extract the URLs that are in a specific PDF file ? If so, you're in the right place. In this tutorial, we will use pikepdf and PyMuPDF libraries in Python to extract all links from PDF files.
We will be using two methods to get links from a particular PDF file, the first is extracting annotations, which are markups, notes and comments, that you can actually click on your regular PDF reader and redirects to your browser, whereas the second is extracting all raw text and using regular expressions to parse URLs.
To get started, let's install these libraries:
pip3 install pikepdf PyMuPDF
Get: Practical Python PDF Processing EBook.
In this technique, we will use the pikepdf library to open a PDF file, iterate over all annotations of each page and see if there is a URL there:
import pikepdf # pip3 install pikepdf
file = "1810.04805.pdf"
# file = "1710.05006.pdf"
pdf_file = pikepdf.Pdf.open(file)
urls = []
# iterate over PDF pages
for page in pdf_file.pages:
for annots in page.get("/Annots"):
uri = annots.get("/A").get("/URI")
if uri is not None:
print("[+] URL Found:", uri)
urls.append(uri)
print("[*] Total URLs extracted:", len(urls))
I'm testing on this PDF file, but feel free to use any PDF file of your choice, just make sure it has some clickable links.
After running that code, I get this output:
[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://gluebenchmark.com/faq
[+] URL Found: https://gluebenchmark.com/leaderboard
...<SNIPPED>...
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 30
Awesome, we have successfully extracted 30 URLs from that PDF paper.
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookIn this section, we will extract all raw text from our PDF file and then we use regular expressions to parse URLs. First, let's get the text version of the PDF:
import fitz # pip install PyMuPDF
import re
# a regular expression of URLs
url_regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=\n]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
# extract raw text from pdf
file = "1710.05006.pdf"
# file = "1810.04805.pdf"
# open the PDF file
with fitz.open(file) as pdf:
text = ""
for page in pdf:
# extract text of each PDF page
text += page.getText()
Now text
is the target string we want to parse URLs, let's use re module to parse them:
urls = []
# extract all urls using the regular expression
for match in re.finditer(url_regex, text):
url = match.group()
print("[+] URL Found:", url)
urls.append(url)
print("[*] Total URLs extracted:", len(urls))
Output:
[+] URL Found: https://github.com/
[+] URL Found: https://github.com/tensor
[+] URL Found: http://nlp.seas.harvard.edu/2018/04/03/attention.html
[+] URL Found: https://gluebenchmark.com/faq.
[+] URL Found: https://gluebenchmark.com/leaderboard).
[+] URL Found: https://gluebenchmark.com/leaderboard
[+] URL Found: https://cloudplatform.googleblog.com/2018/06/Cloud-
[+] URL Found: https://gluebenchmark.com/
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 9
This time we only extract 9 URLs from that same PDF file, now this doesn't mean the second method is not accurate. This method parses only URLs that are in text form (not clickable).
However, there is a problem with this method, as URLs may contain newlines (\n
), so you may want to allow that in url_regex
expression.
So to conclude, if you want to get URLs that are clickable, you may want to use the first method, which is preferable. But if you want to get URLs that are in text form, the second may help you do that!
If you want to extract tables or images from PDF, there are tutorials for that:
Finally, for more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!
Learn also: How to Make an Email Extractor in Python.
Happy Coding ♥
Why juggle between languages when you can convert? Check out our Code Converter. Try it out today!
View Full Code Transform My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!