Real-time Object Tracking with OpenCV and YOLOv8 in Python

Learn how to perform real-time object tracking with the DeepSORT algorithm and YOLOv8 using the OpenCV library in Python.
  · 10 min read · Updated may 2023 · Machine Learning · Computer Vision

Ready to take Python coding to a new level? Explore our Python Code Generator. The perfect tool to get your code up and running in no time. Start now!

In this tutorial, we will learn how to perform object detection and tracking with YOLOv8 and DeepSORT.

We will use the Ultralytics implementation of YOLOv8 which is implemented in PyTorch. So the YOLO model will be used for object detection and the DeepSORT algorithm will be used to track those detected objects.

A tracker can help to identify the same object and assign it a unique ID from frame to frame even when the object detector fails to detect the object in some frames (e.g. when the object is occluded).

DeepSORT is a deep learning-based algorithm for object tracking that was introduced in 2017 in the paper Simple Online and Realtime Tracking with a Deep Association Metric by Nicolai Wojke, Alex Bewley, and Dietrich Paulus.

DeepSORT is based on the SORT algorithm that utilizes a combination of a Kalman filter for prediction and a Hungarian algorithm for data association. However, DeepSORT improves upon SORT by incorporating a deep appearance descriptor to improve the matching of objects over time.

RelatedMastering YOLO: Build an Automatic Number Plate Recognition System with OpenCV in Python.

Table of Contents

Installing the Python packages

In order to use YOLOv8 and DeepSORT, we need to install some Python packages.

There are some issues with the original DeepSORT implementation (some changes need to be made) and we want to get started quickly with object tracking, right?

So I prefer using a more real-time adaptation of DeepSORT in this tutorial.

Here are the commands to install the required Python packages:

$ pip install ultralytics # to use YOLOv8
$ pip install deep-sort-realtime

I assume that you have PyTorch and OpenCV installed on your system. If not, you can install them with the following commands:

$ pip install torch torchvision torchaudio
$ pip install opencv-python

With the packages installed, we can start coding.

Step1: Object Detection with YOLOv8 and OpenCV

Before start tracking objects, we first need to detect them. So in this step, we will use YOLOv8 to detect objects in the video frames.

Create a new Python file and name it object_tracking.py. Then, copy the following code into it:

import datetime
from ultralytics import YOLO
import cv2
from helper import create_video_writer


# define some constants
CONFIDENCE_THRESHOLD = 0.8
GREEN = (0, 255, 0)

# initialize the video capture object
video_cap = cv2.VideoCapture("2.mp4")
# initialize the video writer object
writer = create_video_writer(video_cap, "output.mp4")

# load the pre-trained YOLOv8n model
model = YOLO("yolov8n.pt")

First thing first, we import the required packages. The create_video_writer() function is a helper function that I created to simplify the creation of the video writer object which can then be used to save the output video.

The code for this function is obviously in the helper.py file, make sure to download the code for this tutorial to get access to this file as well.

Then, we define some constants that we will use later. We also initialize the video capture and video writer objects.

Next, we load the pre-trained YOLOv8n model. For testing purposes, we are using the smallest model (YOLOv8n) in the family of YOLOv8 which is the fastest model but has the lowest accuracy.

Now, we can start looping over the video frames:

while True:
    # start time to compute the fps
    start = datetime.datetime.now()

    ret, frame = video_cap.read()

    # if there are no more frames to process, break out of the loop
    if not ret:
        break

    # run the YOLO model on the frame
    detections = model(frame)[0]

In the while loop, we start by reading the next frame from the video capture object. If there are no more frames to process, we break out of the loop. This is especially useful when the video reaches the end so that we don't get an error.

Then, we get the detections from the model which is the YOLOv8 model that we loaded in the previous step.

To get the detections in the form of:

[[xmin, ymin, xmax, ymax, confidence_score, class_id], ...]
# example:
[[835, 15, 1054, 612, 0.94, 0], [549, 260, 679, 623, 0.91, 0], [308, 370, 589, 629, 0.84, 13]]

we can use the .boxes.data.tolist() attribute of the model:

    # loop over the detections
    for data in detections.boxes.data.tolist():
        # extract the confidence (i.e., probability) associated with the detection
        confidence = data[4]

        # filter out weak detections by ensuring the 
        # confidence is greater than the minimum confidence
        if float(confidence) < CONFIDENCE_THRESHOLD:
            continue

        # if the confidence is greater than the minimum confidence,
        # draw the bounding box on the frame
        xmin, ymin, xmax, ymax = int(data[0]), int(data[1]), int(data[2]), int(data[3])
        cv2.rectangle(frame, (xmin, ymin) , (xmax, ymax), GREEN, 2)

So here data is a list of the form:

[xmin, ymin, xmax, ymax, confidence_score, class_id]

So we loop over all the detections (detections.boxes.data.tolist()) and extract the confidence, if the confidence is below the confidence threshold, we skip the detection.

If the confidence is greater than the minimum confidence, we draw the bounding box on the frame.

Finally, we can draw the fps on the frame and write the frame to the output video:

    # end time to compute the fps
    end = datetime.datetime.now()
    # show the time it took to process 1 frame
    total = (end - start).total_seconds()
    print(f"Time to process 1 frame: {total * 1000:.0f} milliseconds")

    # calculate the frame per second and draw it on the frame
    fps = f"FPS: {1 / total:.2f}"
    cv2.putText(frame, fps, (50, 50),
                cv2.FONT_HERSHEY_SIMPLEX, 2, (0, 0, 255), 8)

    # show the frame to our screen
    cv2.imshow("Frame", frame)
    writer.write(frame)
    if cv2.waitKey(1) == ord("q"):
        break

video_cap.release()
writer.release()
cv2.destroyAllWindows()

Here we are calculating the time it took to process 1 frame and then calculating the fps. We draw the fps on the frame and write the frame to the output video.

We also show the frame on our screen and wait for the user to press the q key to break out of the loop.

The video below shows the result of the code above:

The YOLO model is detecting the two people in the video and the bench. I am getting around 15 frames per second on my laptop using the CPU. This is not bad at all.

If you check your terminal you will see the time it took to process 1 frame. On my laptop, it takes around 60-70 milliseconds to process 1 frame. We will see how the tracking algorithm will affect the fps and the processing time in the next step.

Let's move on now to the tracking part.

Mastering YOLO: Build an Automatic Number Plate Recognition System

Building a real-time automatic number plate recognition system using YOLO and OpenCV library in Python

Download EBook

Step2: Object Tracking with DeepSORT and OpenCV

We will build on the code we wrote in the previous step to add the tracking code.

Create a new file called object_detection_tracking.py and let's see how we can add the tracking code:

import datetime
from ultralytics import YOLO
import cv2
from helper import create_video_writer
from deep_sort_realtime.deepsort_tracker import DeepSort


CONFIDENCE_THRESHOLD = 0.8
GREEN = (0, 255, 0)
WHITE = (255, 255, 255)

# initialize the video capture object
video_cap = cv2.VideoCapture("2.mp4")
# initialize the video writer object
writer = create_video_writer(video_cap, "output.mp4")

# load the pre-trained YOLOv8n model
model = YOLO("yolov8n.pt")
tracker = DeepSort(max_age=50)

This code is similar to the code we wrote in the previous step. The only difference is that we are creating the WHITE variable, importing the DeepSort class from the deepsort_tracker module, and initializing the DeepSort object with the max_age parameter set to 50.

The max_age parameter is used to determine how many frames a track can be lost before it is deleted. This is useful when the object is occluded for a few frames.

Let's now write the main loop:

while True:
    start = datetime.datetime.now()

    ret, frame = video_cap.read()

    if not ret:
        break

    # run the YOLO model on the frame
    detections = model(frame)[0]

    # initialize the list of bounding boxes and confidences
    results = []

    ######################################
    # DETECTION
    ######################################

    # loop over the detections
    for data in detections.boxes.data.tolist():
        # extract the confidence (i.e., probability) associated with the prediction
        confidence = data[4]

        # filter out weak detections by ensuring the 
        # confidence is greater than the minimum confidence
        if float(confidence) < CONFIDENCE_THRESHOLD:
            continue

        # if the confidence is greater than the minimum confidence,
        # get the bounding box and the class id
        xmin, ymin, xmax, ymax = int(data[0]), int(data[1]), int(data[2]), int(data[3])
        class_id = int(data[5])
        # add the bounding box (x, y, w, h), confidence and class id to the results list
        results.append([[xmin, ymin, xmax - xmin, ymax - ymin], confidence, class_id])

In the detection part above, we are using the same logic as in the previous step but this time we are adding the bounding box, confidence, and class id to the results list because we will need this information for the tracking algorithm.

Let's now start tracking the objects:

    ######################################
    # TRACKING
    ######################################

    # update the tracker with the new detections
    tracks = tracker.update_tracks(results, frame=frame)
    # loop over the tracks
    for track in tracks:
        # if the track is not confirmed, ignore it
        if not track.is_confirmed():
            continue

        # get the track id and the bounding box
        track_id = track.track_id
        ltrb = track.to_ltrb()

        xmin, ymin, xmax, ymax = int(ltrb[0]), int(
            ltrb[1]), int(ltrb[2]), int(ltrb[3])
        # draw the bounding box and the track id
        cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), GREEN, 2)
        cv2.rectangle(frame, (xmin, ymin - 20), (xmin + 20, ymin), GREEN, -1)
        cv2.putText(frame, str(track_id), (xmin + 5, ymin - 8),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, WHITE, 2)

    # end time to compute the fps
    end = datetime.datetime.now()
    # show the time it took to process 1 frame
    print(f"Time to process 1 frame: {(end - start).total_seconds() * 1000:.0f} milliseconds")
    # calculate the frame per second and draw it on the frame
    fps = f"FPS: {1 / (end - start).total_seconds():.2f}"
    cv2.putText(frame, fps, (50, 50),
                cv2.FONT_HERSHEY_SIMPLEX, 2, (0, 0, 255), 8)

    # show the frame to our screen
    cv2.imshow("Frame", frame)
    writer.write(frame)
    if cv2.waitKey(1) == ord("q"):
        break

video_cap.release()
writer.release()
cv2.destroyAllWindows()

In the code snippet above, we are updating the tracker with the new detections and then looping over the tracks.

If the track is not confirmed, we ignore it, otherwise, we get the track id and the bounding box and draw the bounding box and the track id on the frame.

Let's see how the tracking algorithm performs on the video:

As you can see, each person is assigned a unique id. Notice how the DeepSORT algorithm is able to keep track of the little boy even when he is hidden behind the second person.

The frame per second dropped to ~5 fps. Also if you check the terminal, you will see that it takes around 200 milliseconds to process 1 frame. But don't forget that this time includes the detection part as well.

Using the GPU will certainly improve the performance but my goal here was to show you how to use the DeepSORT tracker with the YOLOv8 object detector.

Summary

So there you have it! We have successfully implemented DeepSORT with YOLOv8 to perform object detection and tracking in a video.

By combining the power of YOLOv8's accurate object detection with DeepSORT's robust tracking algorithm, we are able to identify and track objects even in challenging scenarios such as occlusion or partial visibility.

I hope that you found this tutorial helpful in understanding how to implement object detection and tracking with YOLOv8 and DeepSORT. If you have any questions or feedback, please let me know in the comments below!

You can find the complete code for this tutorial here.

Finally, if you want to dive more into the exciting world of object detection, I suggest you see our comprehensive guide, Mastering YOLO: Build an Automatic Number Plate Recognition System. Whether you're a Python programmer, a hobbyist in computer vision, or a professional developer looking to advance your skills, this book offers a practical, hands-on approach to understanding and implementing YOLO. From setting up your environment to training the model and deploying an ANPR system, this book is a complete roadmap. What's more, it comes with lifetime access to future revisions, source code, and a 30-day money-back guarantee! Elevate your skillset and create real-world solutions with our step-by-step tutorials and clear explanations. Get your digital copy today!

Learn also: Age and Gender Detection using OpenCV in Python.

Happy coding ♥

Why juggle between languages when you can convert? Check out our Code Converter. Try it out today!

View Full Code Assist My Coding
Sharing is caring!



Read Also



Comment panel

    Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!