Creating a YOLO Core ML Object Detector with Decoding Logic

Jarek Szczegielniak

5.00/5 (2 votes)

Nov 25, 2020

CPOL

4 min read

7167

110

In this article we are ready to include detection decoding directly in the Core ML model.

Download source - 5.4 KB

Introduction

This series assumes that you are familiar with Python, Conda, and ONNX, as well as have some experience with developing iOS applications in Xcode. You are welcome to download the source code for this project. We’ll run the code using macOS 10.15+, Xcode 11.7+, and iOS 13+.

Shrinking the Model

To save memory on an iOS device without negatively impacting our model’s performance, we should reduce its weights from 32-bit to 16-bit precision. Note that when the model executes on GPU or Neural Engine of an iOS device – and it should – it always runs with 16-bit floats anyway. Only when running on a CPU, the 32-bit precision can make a difference.

Let’s get to it:

import os
import coremltools as ct
import numpy as np

model_converted = ct.models.MLModel('./models/yolov2-coco-9.mlmodel')
model_converted = ct.models.neural_network.quantization_utils.quantize_weights(
    model_converted, 
    nbits=16, 
    quantization_mode='linear')
model_converted.save('./models/yolov2-16.mlmodel')

Building the YOLO Decoder

We have two options: add decoder layers to the existing model or create a separate one, and then use a pipeline to connect the two. Let’s choose the latter option.

We’ll start by creating a new NeuralNetworkBuilder instance and mapping inputs and outputs of the new decoder model:

from coremltools.models import datatypes

input_features = [ (spec.description.output[0].name, datatypes.Array(1, 425, 13, 13)) ]
output_features = [ ('all_scores', datatypes.Array(1, 845, 80)),
                    ('all_boxes', datatypes.Array(1, 845, 4)) ]

builder = ct.models.neural_network.NeuralNetworkBuilder(
    input_features, 
    output_features, 
    disable_rank5_shape_mapping=True
)

builder.spec.description.input[0].ParseFromString(spec.description.output[0].SerializeToString())

Next, we define constants required for calculations:

GRID_SIZE = 13
CELL_SIZE = 1 / GRID_SIZE 
BOXES_PER_CELL = 5
NUM_CLASSES = 80

ANCHORS_W = np.array([0.57273, 1.87446, 3.33843, 7.88282, 9.77052]).reshape(1, 1, 5)
ANCHORS_H = np.array([0.677385, 2.06253, 5.47434, 3.52778, 9.16828]).reshape(1, 1, 5)

CX = np.tile(np.arange(GRID_SIZE), GRID_SIZE).reshape(1, 1, GRID_SIZE**2, 1)
CY = np.tile(np.arange(GRID_SIZE), GRID_SIZE).reshape(1, GRID_SIZE, GRID_SIZE).transpose()
CY = CY.reshape(1, 1, GRID_SIZE**2, 1)

Note the CELL_SIZE value above. To use our model with the Vision framework, we need to scale the bounding box coordinates from image pixels to the [0-1] range.

To use the defined constants for calculations, we add them to the network:

builder.add_load_constant_nd('CX', output_name='CX', constant_value=CX, shape=CX.shape)
builder.add_load_constant_nd('CY', output_name='CY', constant_value=CY, shape=CY.shape)
builder.add_load_constant_nd('ANCHORS_W', output_name='ANCHORS_W', constant_value=ANCHORS_W, shape=ANCHORS_W.shape)
builder.add_load_constant_nd('ANCHORS_H', output_name='ANCHORS_H', constant_value=ANCHORS_H, shape=ANCHORS_H.shape)

Now we are ready to add layers to our Core ML model. In most cases, it will be a direct conversion of code from the previous article, with the same variable/node names whenever possible. Sometimes, Core ML quirks will enforce small changes though. See the code download for the complete solution because, to increase readability, some obvious code sequences will not be included here.

We start with the layers corresponding to the first two transformations from the previous (vectorized) implementation:

builder.add_transpose(
    'yolo_trans_node', 
    axes=(0,2,3,1), 
    input_name='218', 
    output_name=‘yolo_transp')

builder.add_reshape_static(
    'yolo_reshap', 
    input_name='yolo_transp',
    output_name='yolo_reshap',
    output_shape=(1, GRID_SIZE**2, BOXES_PER_CELL, NUM_CLASSES + 5)
)

When we create a new layer using the NeuralNetworkBuilder instance, we need to specify a unique name for the node and its output_name ("yolo_trans_node" and "yolo_transp" in the first operation above, respectively). The input_name value must correspond to the existing output_name ("218" in this case, which is the output of our converted YOLO v2 model).

To extract the encoded boxes and confidence values, we need to split the input array:

builder.add_split_nd(
    'split_boxes_node', 
    input_name='yolo_reshap',
    output_names=['tx', 'ty', 'tw', 'th', 'tc', 'classes_raw'],    
    axis=3,
    split_sizes=[1, 1, 1, 1, 1, 80])

This operation slices the raw_preds array into tx, ty, tw, th, tc, and classes_raw arrays from the previous article.

Unfortunately, the rest of the code will be much more verbose, because we need a separate node for each basic arithmetic operation. This leads to a situation where a simple line from our vectorized decoder:

x = ((CX + sigmoid(tx)) * CELL_SIZE).reshape(-1)

becomes:

builder.add_reshape_static('tx:1', input_name='tx', output_name='tx:1', output_shape=(1,169,5))
builder.add_activation('tx:1_sigm', non_linearity='SIGMOID', input_name='tx:1', output_name='tx:1_sigm')
builder.add_add_broadcastable('tx:1_add', input_names=['CX', 'tx:1_sigm'], output_name='tx:1_add')
builder.add_elementwise('x', input_names=['tx:1_add'], output_name='x', mode='MULTIPLY', alpha=CELL_SIZE)

Note that, to make the code shorter and more readable, we use the explicit value "169" instead of GRID_SIZE**2 and "5" instead of BOXES_PER_CELL in the output shape parameter. The same applies to "80" instead of the NUM_CLASSES literal in some other places. In a proper and flexible solution, we should stick to literals, of course.

Identical operations are required to calculate y. Then we have a very similar code to calculate the bounding box width (w):

builder.add_reshape_static('tw:1', input_name='tw', output_name='tw:1', output_shape=(1,169,5))
builder.add_unary('tw:1_exp', input_name='tw:1', output_name='tw:1_exp', mode='exp')
builder.add_multiply_broadcastable('tw:1_mul', input_names=['tw:1_exp', 'ANCHORS_W'], output_name='tw:1_mul')
builder.add_elementwise('w', input_names=['tw:1_mul'], output_name='w', mode='MULTIPLY', alpha=CELL_SIZE)

Subsequent calculation of h is, again, very similar (with exception of using the ANCHORS_H instead of the ANCHORS_W constant).

Finally, we decode the box_confidence and classes_confidence values:

builder.add_reshape_static('tc:1', input_name='tc', output_name='tc:1', output_shape=(1,169*5,1))
builder.add_activation('box_confidence', non_linearity='SIGMOID', input_name='tc:1', output_name='box_confidence')
builder.add_reshape_static('classes_raw:1', input_name='classes_raw', output_name='classes_raw:1', output_shape=(1,169*5,80))
builder.add_softmax_nd('classes_confidence', input_name='classes_raw:1', output_name='classes_confidence', axis=-1)

In the YOLO v2 predictions decoding described in the previous articles, we returned a single, most probable class for each box. The Vision framework expects us to return the confidence of each of the 80 classes for each box:

builder.add_multiply_broadcastable(
    'combined_classes_confidence', 
    input_names=['box_confidence', 'classes_confidence'],
    output_name=‘combined_classes_confidence')

Now, we have all the values we need. Next, let’s format these values for the Vision framework into two arrays: one with the coordinates of all of the bounding boxes (with four columns per box), and the second one with the confidence calculated for each box/class combination (with 80 columns per box).

It is not a difficult task, but because we need to handle each transformation as a separate operation, it again leads to verbose code:

builder.add_reshape_static('x:1', input_name='x', output_name='x:1', output_shape=(1,169*5,1))
builder.add_reshape_static('y:1', input_name='y', output_name='y:1', output_shape=(1,169*5,1))
builder.add_reshape_static('w:1', input_name='w', output_name='w:1', output_shape=(1,169*5,1))
builder.add_reshape_static('h:1', input_name='h', output_name='h:1', output_shape=(1,169*5,1))

builder.add_stack(
    'all_boxes:0', 
    input_names=['x:1', 'y:1', 'w:1', 'h:1'], 
    output_name='all_boxes:0', 
    axis=2)

builder.add_reshape_static(
    'all_boxes', 
    input_name='all_boxes:0', 
    output_name='all_boxes',
    output_shape=(1,169*5, 4))

builder.add_reshape_static(
    'all_scores', 
    input_name='combined_classes_confidence', 
    output_name='all_scores',
    output_shape=(1,169*5, 80))

With the all_scores and all_boxes arrays formatted, we can map these arrays to the model’s outputs and save the model itself:

builder.set_output(
    output_names= ['all_scores', 'all_boxes'],
    output_dims= [(845,80), (845,4)])

model_decoder = ct.models.MLModel(builder.spec)
model_decoder.save('./models/yolov2-decoder.mlmodel')

Next Steps

It was a lot of code, but we got to the end. Now we have a Core ML model that can decode YOLO v2 predictions. However, we cannot use it without a link to the YOLO’s output. In the next article, we’ll create a Core ML pipeline to be our end-to-end model.