Real-time AI Person Detection on Edge Devices: Getting Started

Sergey L. Gladkiy

5.00/5 (6 votes)

Oct 5, 2020

CPOL

4 min read

11403

1923

This is the first in an article series where we’ll show you how to detect people in real time (or near-real time) on Raspberry Pi.

Real-time detection of humans in live video streams has been a solved problem for several years now, thanks to Deep Learning (DL). But it’s not as easy on edge and IoT devices that don’t always have a lot of computing power.

In this article series, we’ll show you how to detect people in real time (or near-real time) on Raspberry Pi. This series assumes...

Historically, the first methods of object detection in images, such as Haar or LBP cascades, and Histogram of Oriented Gradients (HOG) classifier, have been designed to focus on finding objects of a specific class. In one way or another, all of these were based on extracting geometric features and deciding whether the detected feature set belonged to an object of the specified class.

When Deep Neural Networks (DNN) entered the scene, they were immediately recognized as very efficient feature extractors. This motivated Artificial Intelligence (AI) researchers to develop object detection-oriented DNNs and efficient algorithms to train them. Today, almost all modern computer vision systems use DL algorithms for object detection. Typically, object detection systems run on high-performance CPUs, GPUs, or even TPUs – processors specially designed for DNNs.

To carry out person detection on devices with low computing power, we must use a device-appropriate AI approach. First, let’s analyze the advantages and drawbacks of the existing DL methods and select the one that will suit us best. As person detection is a special case of object detection, we’ll look at the general-purpose DL object detectors, and then see how to adapt them to our special case.

The following are the most established modern DL methods of object detection: Faster-RCNN, Single-Shot Object Detector, and You Only Look Once (YOLO). All of the above techniques use Convolutional Neural Networks (CNN), and the differences between them are in the details of the internal network structures and algorithms for network training.

Faster-RCNN was the first DL technique successfully used for real-time object detection in video streams. Faster-RCNN is a DNN featuring two major parts. The first part works as a feature extractor. The second part is a specially designed neural network that uses the extracted features to locate object positions and classify them – for example, as the picture below shows:

Single-Shot Detector (SSD) is very similar to Faster-RCNN. It includes a convolutional network for feature extraction and a mechanism for object location and classification. The only difference is that the above mechanism allows the network to localize and classify objects in a single forward pass. This makes SSD faster and simpler to train.

YOLO can also be considered a single-shot detector. It divides an image into a grid of cells and then predicts the score for each cell belonging to a certain class.

We’ll review the advantages and drawbacks of the DL technologies in the next article of the series. Meanwhile, let’s consider how we can apply the DL methods to our specific goal - person detection on edge devices.

We have two main goals:

Detect only humans
Use limited computational resources

There are two ways to meet both of the above requirements using DL methods:

Design and train a new CNN for detecting only humans and optimize it for the size and speed the edge devices can provide
Find an existing CNN model for object detection that can be used on the edge devices and apply it for detecting humans

Professional software developers would probably prefer the first way. For example, to produce a video surveillance system that requires human detection. This would allow us to create real-life software optimized for running on Raspberry Pi devices. At the same time, it would require a lot of time and effort to design the CNN architecture from scratch, acquire a dataset of humans, and train the network.

The second way is much simpler. We just need to find an existing DNN that could be used on a device with low computing resources. It requires neither much effort nor comprehensive immersion in the DL technologies. Our goal being just showcasing person detection on edge devices – not creating the best CNN in the world – we’ll take the easy way.

However, even our "easy" way requires the use of specialized computer vision and DL software, as well as writing bespoke computer code. We’ll use Python along with OpenCV as a computer vision system. First, we’ll write the Python code using CNNs and OpenCV and testing them for detection of humans in images.

Then, we’ll write more code – for launching person detection on a Raspberry Pi 3 Model B device. We hope you’ll follow along; if you do, you’ll need to install OpenCV for Python on your computer.