Struggling with multiple programming languages? No worries. Our Code Converter has got you covered. Give it a go!
In this tutorial, we will demonstrate how to extract images from PDF files and save them on the local disk using Python, along with the PyMuPDF and Pillow libraries.
PyMuPDF is a versatile library that allows you to access PDF, XPS, OpenXPS, epub, and various other file extensions, while Pillow is an open-source Python imaging library that adds image processing capabilities to your Python interpreter.
To follow along with this tutorial, you will need:
Download: Practical Python PDF Processing EBook.
First, we need to install the PyMuPDF and Pillow libraries. Open your terminal or command prompt and run the following command:
pip3 install PyMuPDF Pillow
Create a new Python file named pdf_image_extractor.py and import the necessary libraries. Also, define the output directory, output image format, and minimum dimensions for the extracted images:
import os
import fitz # PyMuPDF
import io
from PIL import Image
# Output directory for the extracted images
output_dir = "extracted_images"
# Desired output image format
output_format = "png"
# Minimum width and height for extracted images
min_width = 100
min_height = 100
# Create the output directory if it does not exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookI'm gonna test this with this PDF file, but you're free to bring and PDF file and put it in your current working directory, let's load it to the library:
# file path you want to extract images from
file = "1710.05006.pdf"
# open the file
pdf_file = fitz.open(file)
Since we want to extract images from all pages, we need to iterate over all the pages available and get all image objects on each page, the following code does that:
# Iterate over PDF pages
for page_index in range(len(pdf_file)):
# Get the page itself
page = pdf_file[page_index]
# Get image list
image_list = page.get_images(full=True)
# Print the number of images found on this page
if image_list:
print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
else:
print(f"[!] No images found on page {page_index}")
# Iterate over the images on the page
for image_index, img in enumerate(image_list, start=1):
# Get the XREF of the image
xref = img[0]
# Extract the image bytes
base_image = pdf_file.extract_image(xref)
image_bytes = base_image["image"]
# Get the image extension
image_ext = base_image["ext"]
# Load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# Check if the image meets the minimum dimensions and save it
if image.width >= min_width and image.height >= min_height:
image.save(
open(os.path.join(output_dir, f"image{page_index + 1}_{image_index}.{output_format}"), "wb"),
format=output_format.upper())
else:
print(f"[-] Skipping image {image_index} on page {page_index} due to its small size.")
Related: How to Convert PDF to Images in Python.
In this code snippet, we use the get_images(full=True)
method to list all available image objects on a particular page. Then, we loop through the images, check if they meet the minimum dimensions, and save them using the specified output format in the output directory.
We use the extract_image()
method that returns the image in bytes and additional information, such as the image extension.
So, we convert the image bytes to a PIL image instance and save it to the local disk using the save()
method which accepts a file pointer as an argument; we're simply naming the images with their corresponding page and image indices.
Now, save the script and run it using the following command:
$ python pdf_image_extractor.py
I got the following output:
[!] No images found on page 0
[+] Found a total of 3 images in page 1
[+] Found a total of 3 images in page 2
[!] No images found on page 3
[!] No images found on page 4
The extracted images that meet the minimum dimensions will be saved in the specified output directory with their corresponding page and image indices, using the desired output format.
The images are saved in the extracted_images
folder, as specified:
In this tutorial, we have successfully demonstrated how to extract images from PDF files using Python, PyMuPDF, and Pillow libraries. This technique can be extended to work with various file formats and customized to fit your specific requirements.
For more information on how the libraries work, refer to the official documentation:
I have used the argparse module to create a quick CLI script, you can get both versions of the code here.
Here are some PDF Related tutorials:
Finally, for more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!
Happy coding ♥
Let our Code Converter simplify your multi-language projects. It's like having a coding translator at your fingertips. Don't miss out!
View Full Code Build My Python Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!