How to Download All Images from a Web Page in Python

Abdeladim Fadheli · 5 min read · Updated may 2022 · Web Scraping

Welcome! Meet our Python Code Assistant, your new coding buddy. Why wait? Start exploring now!

Have you ever wanted to download all images on a certain web page? In this tutorial, you will learn how you can build a Python scraper that retrieves all images from a web page given its URL and downloads them using requests and BeautifulSoup libraries.

To get started, we need quite a few dependencies, let's install them:

pip3 install requests bs4 tqdm

Open up a new Python file and import necessary modules:

import requests
import os
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse

First, let's make a URL validator, that makes sure that the URL passed is a valid one, as there are some websites that put encoded data in the place of a URL, so we need to skip those:

def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

urlparse() function parses a URL into six components, we just need to see if the netloc (domain name) and scheme (protocol) are there.

Second, I'm going to write the core function that grabs all image URLs of a web page:

def get_all_images(url):
    """
    Returns all image URLs on a single `url`
    """
    soup = bs(requests.get(url).content, "html.parser")

The HTML content of the web page is in soup object, to extract all img tags in HTML, we need to use soup.find_all("img") method, let's see it in action:

    urls = []
    for img in tqdm(soup.find_all("img"), "Extracting images"):
        img_url = img.attrs.get("src")
        if not img_url:
            # if img does not contain src attribute, just skip
            continue

This will retrieve all img elements as a Python list.

I've wrapped it in a tqdm object just to print a progress bar though. To grab the URL of an img tag, there is a src attribute. However, there are some tags that do not contain the src attribute, we skip those by using the continue statement above.

Now we need to make sure that the URL is absolute:

        # make the URL absolute by joining domain with the URL that is just extracted
        img_url = urljoin(url, img_url)

There are some URLs that contains HTTP GET key-value pairs that we don't like (that ends with something like this "/image.png?c=3.2.5"), let's remove them:

        try:
            pos = img_url.index("?")
            img_url = img_url[:pos]
        except ValueError:
            pass

We're getting the position of '?' character, then removing everything after it, if there isn't any, it will raise ValueError, that's why I wrapped it in try/except block (of course you can implement it in a better way, if so, please share with us in the comments below).

Now let's make sure that every URL is valid and returns all the image URLs:

        # finally, if the url is valid
        if is_valid(img_url):
            urls.append(img_url)
    return urls

Now that we have a function that grabs all image URLs, we need a function to download files from the web with Python, I brought the following function from this tutorial:

def download(url, pathname):
    """
    Downloads a file given an URL and puts it in the folder `pathname`
    """
    # if path doesn't exist, make that path dir
    if not os.path.isdir(pathname):
        os.makedirs(pathname)
    # download the body of response by chunk, not immediately
    response = requests.get(url, stream=True)
    # get the total file size
    file_size = int(response.headers.get("Content-Length", 0))
    # get the file name
    filename = os.path.join(pathname, url.split("/")[-1])
    # progress bar, changing the unit to bytes instead of iteration (default by tqdm)
    progress = tqdm(response.iter_content(1024), f"Downloading {filename}", total=file_size, unit="B", unit_scale=True, unit_divisor=1024)
    with open(filename, "wb") as f:
        for data in progress.iterable:
            # write data read to the file
            f.write(data)
            # update the progress bar manually
            progress.update(len(data))

The above function basically takes the file url to download and the pathname of the folder to save that file into.

Finally, here is the main function:

def main(url, path):
    # get all images
    imgs = get_all_images(url)
    for img in imgs:
        # for each image, download it
        download(img, path)

Getting all image URLs from that page and download each of them one by one. Let's test this:

main("https://yandex.com/images/", "yandex-images")

This will download all images from that URL and stores them in the folder "yandex-images" that will be created automatically.

Note though, there are some websites that load their data using Javascript, in that case, you should use requests_html library instead, I've already made another script that makes some tweaks to the original one and handles Javascript rendering, check it here.

Alright, we're done! Here are some ideas you can implement to extend your code:

Extracting all links on a web page and downloading all images on each.
Download every PDF file on a given website.
Use multi-threading to accelerate the download (since this is a heavy IO task).
Use proxies to prevent certain websites from blocking your IP address.

Want to Learn More about Web Scraping?

Finally, if you want to dig more into web scraping with different Python libraries, not just BeautifulSoup, the below courses will definitely be valuable for you:

Learn Also: How to Make an Email Extractor in Python.

Happy Scraping ♥

Just finished the article? Why not take your Python skills a notch higher with our Python Code Assistant? Check it out!

View Full Code Create Code for Me

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!