Download source - 4.2 KB

When working with an object detection algorithm, the basic approach is to try to locate the object of interest by drawing a bounding box around it. Since there could be multiple objects of interest and their number of occurrences are not known beforehand, this results in a variable length output layer, which means the object detection problem cannot be solved by building a standard deep neural network consisting of fully connected layers. One workaround to this problem could be to take different regions of interest from the image and use a neural network to detect the presence of the desired object within each specific region. This approach seems to fail as well since the desired objects could have different aspect ratios and locations within the image resulting in a massive number of regions and eventually computationally blowing up.

To solve the problem, algorithms like R-CNN, Fast R-CNN, and YOLO have been developed. In this article, we’ll implement R-CNN to detect humans in a given image.

R-CNN

Regions with CNN (R-CNN) was introduced by Ross, Jeff and Jitendra in 2014. The idea was that instead of running detection on a huge number of regions, we pass the image through selective search to extract just 2000 regions from the image called region proposals. Now, we can just work with these 2000 proposed regions instead of trying to classify a huge number of regions. Next, we calculate intersection over union (IOU) on proposed regions and add labels using ground truth data. To make sense of everything, we’re going to implement R-CNN from scratch using Keras here, but we’ll definitely get into more details of R-CNN later in the series.

Preparing the Dataset for Object Detection

We’re going to use INRIAPerson dataset which is easily available on kaggle. The mentioned dataset has 2 subdirectories containing Test and Train data and both subdirectories have images along with their associated annotations. Image annotations basically label the data on images and make the objects perceptible to AI and ML models. These images may contain humans, vehicles, or any other type of objects to make it recognizable for machines. However, the INRIAPerson dataset is specifically created to detect persons in the image file and thus only contains annotations for person. Taking a look at the following file will make the idea clear.

<?xml version="1.0" ?>
<annotation>
   	<folder>VOC2007</folder>
   	<filename>crop_000010.png</filename>
   	<source>
          	<database>PASperson Database</database>
          	<annotation>PASperson</annotation>
   	</source>
   	<size>
          	<width>594</width>
          	<height>720</height>
          	<depth>3</depth>
   	</size>
   	<segmented>0</segmented>
   	<object>
          	<name>person</name>
          	<pose>Unspecified</pose>
          	<truncated>0</truncated>
          	<difficult>0</difficult>
          	<bndbox>
                 	<xmin>194</xmin>
                 	<ymin>127</ymin>
                 	<xmax>413</xmax>
                 	<ymax>647</ymax>
          	</bndbox>
   	</object>
</annotation>

We will use these annotations to make the objects recognizable to our model. But before we move forward, we first need to parse these annotations to csv files and extract the data that we need. Python provides the ElementTree API for parsing xml files. Below is a function that can be used to load and parse xml annotation files easily.

def parse_xml_to_csv(path):
	xml_list = []
	#iterate over all files to extract the bounding box for person present in the corresponding image
	for xml_annot in glob.glob(path + '/*.xml'):
    		#load and parse file
    		tree = ET.parse(xml_annot)
    		doc = etree.parse(xml_annot)
    		count = doc.xpath("count(//object)")
              #getting root of the document
    		root = tree.getroot()
    		with open(str(xml_annot)[0:-4]+".csv","w+") as file:
            		file.write(str(int(count)))
    		for person in root.findall('object'):
            		value = (
                     	person[4][0].text,
                     	person[4][1].text,
                     	person[4][2].text,
                     	person[4][3].text
                     	)
            		coors = " ".join(value)
        	
            		with open(str(xml_annot)[0:-4]+".csv","a") as file:
                    	file.write("\n")
                    	file.write(coors)

Call the above function, passing the path to annotations files as the argument:

annot_path ="./Annotations"
xml_df = parse_xml_to_csv(annot_path)

Once the function finishes, you can see all the converted csv files.

Implementing R-CNN Using Keras

With our data in place, we can move forward with implementing R-CNN. First, let’s import all the libraries we’re going to work with.

import os                        # to interact with OS
import cv2                       # to perform selective search on images
import keras                     # to implement neural net
import numpy as np               # to work with arrays
import pandas as pd              # for data manipulation
import tensorflow as tf          # for deep learning models
import matplotlib.pyplot as plt  # for plotting

As we mentioned earlier, searching for regions of interest is computationally exhausting so we will try to implement an efficient solution here. Selective search computes the similarity based on color, texture, size, or shape and hierarchically groups the most similar regions. This process is continued until the whole image becomes a single region. OpenCV offers to implement selective search using the createSelectiveSearchSegmentation function. Add the optimization and selective search to your solution as follows:

# OpenCV optimization
cv2.setUseOptimized(True);
# selective search
selective_search = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()

Now if we apply selective search on a test image, it would result in bounding boxes around the desired object.

At this point, we’re interested in how accurate our bounding boxes currently are. For that, we can simply use intersection over union (IOU), which is an evaluation metric that measures the accuracy of object detectors. It can be calculated by computing the area of overlap (area of intersection) between the predicted bounding box and the ground-truth bounding box divided by the total area bounded by both (area of union):

 def compute_iou(box1, box2):
	x_left = max(box1['x1'], box2['x1'])
	y_top = max(box1['y1'], box2['y1'])
	x_right = min(box1['x2'], box2['x2'])
	y_bottom = min(box1['y2'], box2['y2'])
 
	intersection_area = (x_right - x_left) * (y_bottom - y_top)
 
	box1_area = (box1['x2'] - box1['x1']) * (box1['y2'] - box1['y1'])
	box2_area = (box2['x2'] - box2['x1']) * (box2['y2'] - box2['y1'])
	
	union_area = box1_area + box2_area - intersection_area
 
	iou = intersection_area / union_area
 
	return iou

Now we need to pre-process the data to create a dataset that can be passed to our model. We will iterate over all the images and set them as the base for selective search. We’ll then iterate over the first 2000 proposed regions resulting from selective search and calculate the IOU so we can annotate the region of our desired object (a human). The images will be labeled based on the presence of the object and will be appended to our training_images array.

training_images=[]
training_labels=[]
for e,i in enumerate(os.listdir(annot)):
	try:
    	filename = i.split(".")[0]+".png"
    	img = cv2.imread(os.path.join(path,filename))
    	dataframe = pd.read_csv(os.path.join(annot,i))
    	ground_truth_values=[]
    	for row in dataframe.iterrows():
        	x1 = int(row[1][0].split(" ")[0])
        	y1 = int(row[1][0].split(" ")[1])
        	x2 = int(row[1][0].split(" ")[2])
        	y2 = int(row[1][0].split(" ")[3])
            ground_truth_values.append({"x1":x1,"x2":x2,"y1":y1,"y2":y2})
        	
    	# setting the image as base image for selective search
    	selective_search.setBaseImage(img)
    	
    	# initializing fast selective search
        selective_search.switchToSelectiveSearchFast()
    	
    	# getting proposed regions
    	ssresults = selective_search.process()
    	imout = img.copy()
    	counter = 0
    	f_counter = 0
    	flag = 0
    	fflag = 0
    	bflag = 0
    	for e,result in enumerate(ssresults):
        	
        	# iterating over the first 2000 results from selective search to colculate IOU
        	if e < 2000 and flag == 0:
          	  for val in ground_truth_values:
                	x,y,w,h = result
                	iou = compute_iou(val,{"x1":x,"x2":x+w,"y1":y,"y2":y+h})
                	
                	# limiting the maximum positive samples to 20
                	if counter < 20:
                    	
                    	# setting IOU > 0.70 as goodness measure for positive i.e. person detected
                    	if iou > 0.70:
                        	image = imout[y:y+h,x:x+w]
                 	       resized = cv2.resize(image, (224,224), interpolation = cv2.INTER_AREA)
                            training_images.append(resized)
                            training_labels.append(1)
                        	counter += 1
                	else :
                    	fflag =1
                    	
                	# limiting the maximum negative samples to 20
                	if f_counter <20:
                    	if iou < 0.3:
                        	image = imout[y:y+h,x:x+w]
                        	resized = cv2.resizetimage, (224,224), interpolation = cv2.INTER_AREA)
                            training_images.append(resized)
                        	training_labels.append(0)
                        	f_counter += 1
                	else :
                    	bflag = 1
            	if fflag == 1 and bflag == 1:
                	flag = 1
	except Exception as e:
    	print(e)
    	continue

Now training_images and training_labels contain the new x and y coordinates for our model. Let’s start with the imports for our model.

from keras.layers import Dense
from keras import Model
from keras import optimizers
from keras.optimizers import Adam
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint, EarlyStopping

The R-CNN model can technically be fit from scratch, but it costs too much time and results in poor performance. Here, we’ll use transfer learning to save us time and achieve better performance. You can use imagenet or coco weights for transfer learning as per your preference.

vggmodel = VGG16(weights='imagenet', include_top=True)
vggmodel.summary()

The above code fragment results in the following output:

Next, we’ll freeze the first ten layers of the model by setting trainable to false.

for layers in (vggmodel.layers)[:10]:
layers.trainable = False

We’re only interested in the presence or non-presence of humans, which means we have only two classes to predict, so we’ll add two unit dense layers with softmax activation. The reason for using softmax activation is that it ensures the sum of outputs is 1 (that is, the outputs are probabilities).

X= vggmodel.layers[-2].output
predictions = Dense(2, activation="softmax")(X)

Finally, we’ll use Adam optimizer to compile the model.

model_final = Model(vggmodel.input, predictions)
model_final.compile(loss = keras.losses.categorical_crossentropy, optimizer = Adam(lr=0.001), metrics=["accuracy"])

We’ve created our model. Before moving forward, we need to encode the dataset. We can use LabelBinarizer for encoding.

class Label_Binarizer(LabelBinarizer):
	def transform(self, y_old):
    	Y = super().transform(y_old)
    	if self.y_type_ == 'binary':
        	return np.hstack((Y, 1-Y))
    	else:
    	    return Y
	def inverse(self, Y):
    	if self.y_type_ == 'binary':
        	return super().inverse(Y[:, 0])
    	else:
        	return super().inverse(Y)
encoded = Label_Binarizer()
Y =  encoded.fit_transform(y_new)

We also need to split our dataset into training and test sets which can be done using train_test_split from sklearn. Here we're splitting the data into an 80% training, 20% testing ratio.

X_train, X_test , y_train, y_test = train_test_split(X_new,Y,test_size=0.20)

Keras provides ImageDataGenerator to pass the dataset to the model. You can also apply horizontal or vertical flips to increase the dataset.

train_data_prep = ImageDataGenerator(horizontal_flip=True, vertical_flip=True, rotation_range=90)
trainingdata = train_data_prep.flow(x=X_train, y=y_train)
test_data_prep = ImageDataGenerator(horizontal_flip=True, vertical_flip=True, rotation_range=90)
testingdata = test_data_prep.flow(x=X_test, y=y_test)

Adding Keras Callbacks

Training deep neural networks takes a lot of time and we’re always at a risk of wasting computational resources. To avoid this problem, Keras offers two callbacks: EarlyStopping and ModelCheckPoint. EarlyStopping is called when an epoch finishes. It aborts the training process once it no longer improves, allowing you to configure any number of epochs. ModelCheckpoint is also called after every epoch and saves the best performing model automatically. We can use both callbacks when training our model using fit_generator as follows:

checkpoint = ModelCheckpoint("rcnn.h5", monitor='val_loss', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', save_freq=1)
early = EarlyStopping(monitor='val_loss', min_delta=0, patience=100, verbose=1, mode='auto')
hist = model_final.fit_generator(generator= traindata, steps_per_epoch= 10, epochs= 500, validation_data= testdata, validation_steps=2, callbacks=[checkpoint,early])

Testing Our Model

Our model will now be created and saved as rcnn.h5 and we’re in a good position to get predictions on our model. We’ll follow the same steps as before: Iterate over all images and set them as base images for selective search. Later, we’ll pass selective search results to our model for prediction, and our model will create bounding boxes when it encounters a human in the image.

count=0
for e,i in enumerate(os.listdir(path)):
	
	count += 1
	image = cv2.imread(os.path.join(path,i))
	selective_search.setBaseImage(image)
    selective_search.switchToSelectiveSearchFast()
	ssresults = selective_search.process()
	imout = image.copy()
	for e, res in enumerate(ssresults):
    	if e < 2000:
        	x,y,w,h = res
        	test_image = imout[y:y+h,x:x+w]
        	resized = cv2.resize(test_image, (224,224), interpolation = cv2.INTER_AREA)
        	image = np.expand_dims(resized, axis=0)
        	out= model_final.predict(image)
        	if out[0][0] > 0.65:
            	cv2.rectangle(imout, (x, y), (x+w, y+h), (0, 255, 0), 1, cv2.LINE_AA)
	plt.figure()
	plt.imshow(imout)

Note: Results are from an early termination

Limitations of R-CNN

R-CNN comes along with some drawbacks. It still implements the sliding window at its root. The only difference is that it is actually implemented as a convolution which makes it more efficient than traditional sliding window techniques. But it still needs to run a full forward pass of CNN for each of the 2000 region proposals and has a complex multi-stage training pipeline, resulting in big performance concerns. Also, because of high testing time, R-CNN becomes infeasible in real time or congested areas.

What’s next?

In this article, we learned to implement our first custom object detector using deep neural networks in Keras. We also discussed some limitations of the approach. In the next article of the series, we will try to overcome the limitations imposed by R-CNN and will also get an estimate of the number of people present in an area.