Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!
In this tutorial, we will demonstrate how to extract images from PDF files and save them on the local disk using Python, along with the PyMuPDF and Pillow libraries.
PyMuPDF is a versatile library that allows you to access PDF, XPS, OpenXPS, epub, and various other file extensions, while Pillow is an open-source Python imaging library that adds image processing capabilities to your Python interpreter.
To follow along with this tutorial, you will need:
Download: Practical Python PDF Processing EBook.
First, we need to install the PyMuPDF and Pillow libraries. Open your terminal or command prompt and run the following command:
Create a new Python file named pdf_image_extractor.py and import the necessary libraries. Also, define the output directory, output image format, and minimum dimensions for the extracted images:
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookI'm gonna test this with this PDF file, but you're free to bring and PDF file and put it in your current working directory, let's load it to the library:
Since we want to extract images from all pages, we need to iterate over all the pages available and get all image objects on each page, the following code does that:
Related: How to Convert PDF to Images in Python.
In this code snippet, we use the get_images(full=True)
method to list all available image objects on a particular page. Then, we loop through the images, check if they meet the minimum dimensions, and save them using the specified output format in the output directory.
We use the extract_image()
method that returns the image in bytes and additional information, such as the image extension.
So, we convert the image bytes to a PIL image instance and save it to the local disk using the save()
method which accepts a file pointer as an argument; we're simply naming the images with their corresponding page and image indices.
Now, save the script and run it using the following command:
I got the following output:
The extracted images that meet the minimum dimensions will be saved in the specified output directory with their corresponding page and image indices, using the desired output format.
The images are saved in the extracted_images
folder, as specified:
In this tutorial, we have successfully demonstrated how to extract images from PDF files using Python, PyMuPDF, and Pillow libraries. This technique can be extended to work with various file formats and customized to fit your specific requirements.
For more information on how the libraries work, refer to the official documentation:
I have used the argparse module to create a quick CLI script, you can get both versions of the code here.
Here are some PDF Related tutorials:
Finally, for more PDF handling guides on Python, you can check our Practical Python PDF Processing EBook, where we dive deeper into PDF document manipulation with Python, make sure to check it out here if you're interested!
Happy coding ♥
Ready for more? Dive deeper into coding with our AI-powered Code Explainer. Don't miss it!
View Full Code Build My Python Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!