The beginner’s guide to implementing YOLOv3 in TensorFlow 2.0 (part-2)

Posted by Rahmad Sadli on December 27, 2019 in Deep Learning, Machine Learning, Object Detection, Python Programming

In part 1, we’ve discussed the YOLOv3 algorithm. Now, it’s time to dive into the technical details for the implementation of YOLOv3 in Tensorflow 2.

The code for this tutorial designed to run on Python 3.7 and TensorFlow 2.0 can be found in my Github repo.

This tutorial was inspired by Ayoosh Kathuria, from one of his great articles about the implementation of YOLOv3 in Pytorch published in paperspaces’s blog (credit link at the end of this tutorial).

YOLOv3 has 2 important files: yolov3.cfg and yolov3.weights. The file yolov3.cfg contains all information related to the YOLOv3 architecture and its parameters, whereas the file yolov3.weights contains the convolutional neural network (CNN) parameters of the YOLOv3 pre-trained weights.

Specifically, in this part, we’ll focus only on the file yolov3.cfg, while the file yolov3.weights will be discussed in the next part.

So, what we’re going to do now is to parse the parameters from the file yolov3.cfg, read them all, and based on that we’ll construct the YOLOv3 network.

Take your hot drink and let’s get into it…

Preparation

Creating a Project Directory and Files

The first thing that we need to do is to create a project directory, I personally name it PROJECTS because I have several projects under it. However, feel free to give another name as you want, but I suggest you do the same thing as I did so that you can follow this tutorial easily.

Under PROJECTS, create a directory named YOLOv3_TF2. This is the directory where we’ll be working.

Now, under the YOLOv3_TF2 directory, let’s create 4 subdirectories, namely: img, cfg, data,and weights.

And still under PROJECTS, now create 5 python files, they are:

  • yolov3.py,
  • convert_weights.py,
  • utils.py,
  • image.py, and
  • video.py.

Specifically, in this part, we’ll only work on the file yolov3.py and leave the others all empty for the moment.

Downloading files yolov3.cfg, yolov3.weights, and coco.names

Here are the links to download the files yolov3.cfg, yolov3.weights, and coco.names:

Save the files yolov3.cfg, yolov3.weights, and coco.names to the subdirectories cfg, weights, and data, respectively.

yolov3.py

Importing the necessary packages

Open the yolov3.py and import TensorFlow and Keras Model. We also import the layers from Keras, they are Conv2D, Input, ZeroPadding2D, LeakyReLU, and UpSampling2D. We’ll use them all when we build the YOLOv3 network.

Copy the following lines to the top of the file yolov3.py.

#yolov3.py
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import BatchNormalization, Conv2D, \
    Input, ZeroPadding2D, LeakyReLU, UpSampling2D

Parsing the configuration file

The code below is a function called parse_cfg() with a parameter named cfgfile used to parse the YOLOv3 configuration fileyolov3.cfg.

def parse_cfg(cfgfile):
    with open(cfgfile, 'r') as file:
        lines = [line.rstrip('\n') for line in file if line != '\n' and line[0] != '#']
    holder = {}
    blocks = []
    for line in lines:
        if line[0] == '[':
            line = 'type=' + line[1:-1].rstrip()
            if len(holder) != 0:
                blocks.append(holder)
                holder = {}
        key, value = line.split("=")
        holder[key.rstrip()] = value.lstrip()
    blocks.append(holder)
    return blocks

Let’s explain this code.

Lines 11-12, we open the cfgfile and read it, then remove unnecessary characters like ‘\n’ and ‘#’.

The variable lines in line 12 is now holding all the lines of the file yolov3.cfg. So, we need to loop over it in order to read every single line from it.

Lines 15-23, loop over the variable lines and read every single attribute from it and store them all in the list blocks. This process is performed by reading the attributes block per block. The block’s attributes and their values are firstly stored as the key-value pairs in a dictionary holder. After reading each block, all attributes are then appended to the list blocks and the holder is then made empty and ready to read another block. Loop until all blocks are read before returning the content of the list blocks.

All right!..we just finished a small piece of code. The next step is to create the YOLOv3 network function. Let’s do it..

Building the YOLOv3 Network

We’re still working on the file yolov3.py, the following is the code for the YOLOv3 network function, called the YOLOv3Net. We pass a parameter named cfgfile. So, Just copy and paste the following lines under the previous function parse_cfg().

def YOLOv3Net(cfgfile, model_size, num_classes):

    blocks = parse_cfg(cfgfile)

    outputs = {}
    output_filters = []
    filters = []
    out_pred = []
    scale = 0

    inputs = input_image = Input(shape=model_size)
    inputs = inputs / 255.0

Let’s look at it…

Line 27, we first call the function parse_cfg() and store all the return attributes in a variable blocks. Here, the variable blocks contains all the attributes read from the file yolov3.cfg.

Lines 37-38, we define the input model using Keras function and divided by 255 to normalize it to the range of 0–1.

Next…

YOLOv3 has 5 layers types in general, they are: “convolutional layer”, “upsample layer”, “route layer”, “shortcut layer”, and “yolo layer”.

The following code performs an iteration over the list blocks. For every iteration, we check the type of the block which corresponds to the type of layer.

    for i, block in enumerate(blocks[1:]):

Convolutional Layer

In YOLOv3, there are 2 convolutional layer types, i.e with and without batch normalization layer. The convolutional layer followed by a batch normalization layer uses a leaky ReLU activation layer, otherwise, it uses the linear activation. So, we must handle them for every single iteration we perform.

This is the code to perform the convolutional layer.

        # If it is a convolutional layer
        if (block["type"] == "convolutional"):

            activation = block["activation"]
            filters = int(block["filters"])
            kernel_size = int(block["size"])
            strides = int(block["stride"])

            if strides > 1:
                inputs = ZeroPadding2D(((1, 0), (1, 0)))(inputs)

            inputs = Conv2D(filters,
                            kernel_size,
                            strides=strides,
                            padding='valid' if strides > 1 else 'same',
                            name='conv_' + str(i),
                            use_bias=False if ("batch_normalize" in block) else True)(inputs)

            if "batch_normalize" in block:
                inputs = BatchNormalization(name='bnorm_' + str(i))(inputs)
                inputs = LeakyReLU(alpha=0.1, name='leaky_' + str(i))(inputs)

Line 42, we check whether the type of the block is a convolutional block, if it is true then read the attributes associated with it, otherwise, go check for another type ( we’ll be explaining after this). In the convolutional block, you’ll find the following attributes: batch_normalize, activation, filters, pad, size, and stride. For more details, what attributes are in the convolutional blocks, you can open the file yolov3.cfg.

Lines 49-50, verify whether the strideis greater than 1, if it is true, then downsampling is performed, so we need to adjust the padding.

Lines 59-61, if we find batch_normalizein a block, then add layers BatchNormalization and LeakyReLU, otherwise, do nothing.

Upsample Layer

Now, we’re going to continue if..else case above. Here, we’re going to check for the upsample layer. The upsample layer performs upsampling of the previous feature map by a factor of stride. To do this, YOLOv3 uses bilinear upsampling method.
So, if we find upsample block, retrieve the stride value and add a layer UpSampling2D by specifying the stride value.

The following is the code for that.

        elif (block["type"] == "upsample"):
            stride = int(block["stride"])
            inputs = UpSampling2D(stride)(inputs)

Route Layer

The route block contains an attribute layers which holds one or two values. For more details, please look at the file yolov3.cfg and point to lines 619-634. There, you will find the following lines.

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

I’ll explain a little bit about the above lines of yolov3.cfg.

In the line 620 above, the attribute layers holds a value of -4 which means that if we are in this route block, we need to backward 4 layers and then output the feature map from that layer. However, for the case of the route block whose attribute layers has 2 values like in lines 633-634, layers contains -1 and 61, we need to concatenate the feature map from a previous layer (-1) and the feature map from layer 61. So, the following is the code for the route layer.

        # If it is a route layer
        elif (block["type"] == "route"):
            block["layers"] = block["layers"].split(',')
            start = int(block["layers"][0])

            if len(block["layers"]) > 1:
                end = int(block["layers"][1]) - i
                filters = output_filters[i + start] + output_filters[end]  # Index negatif :end - index
                inputs = tf.concat([outputs[i + start], outputs[i + end]], axis=-1)
            else:
                filters = output_filters[i + start]
                inputs = outputs[i + start]

Shortcut Layer

In this layer, we perform skip connection. If we look at the file yolov3.cfg, this block contains an attribute from as shown below.

[shortcut]
from=-3
activation=linear

What we’re going to do in this layer block is to backward 3 layers (-3) as indicated in from value, then take the feature map from that layer, and add it with the feature map from the previous layer. Here is the code for that.

        elif block["type"] == "shortcut":
            from_ = int(block["from"])
            inputs = outputs[i - 1] + outputs[i + from_]

Yolo Layer

Here, we perform our detection and do some refining to the bounding boxes. If you have any difficulty understanding or have a problem with this part, just check out my previous post (part-1 of this tutorial).

As we did to other layers, just check whether we’re in the yolo layer.

        # Yolo detection layer
        elif block["type"] == "yolo":

If it is true, then take all the necessary attributes associated with it. In this case, we just need mask and anchors attributes.

            mask = block["mask"].split(",")
            mask = [int(x) for x in mask]
            anchors = block["anchors"].split(",")
            anchors = [int(a) for a in anchors]
            anchors = [(anchors[i], anchors[i + 1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in mask]
            n_anchors = len(anchors)

Then we need to reshape the YOLOv3 output to the form of [None, B * grid size * grid size, 5 + C]. The B is the number of anchors and C is the number of classes.

            out_shape = inputs.get_shape().as_list()

            inputs = tf.reshape(inputs, [-1, n_anchors * out_shape[1] * out_shape[2], \
										 5 + num_classes])

Then access all boxes attributes by this way:

            box_centers = inputs[:, :, 0:2]
            box_shapes = inputs[:, :, 2:4]
            confidence = inputs[:, :, 4:5]
            classes = inputs[:, :, 5:num_classes + 5]

Refine Bounding Boxes

As I mentioned in part 1 that after the YOLOv3 network outputs the bounding boxes prediction, we need to refine them in order to the have the right positions and shapes.

Use the sigmoid function to convert box_centers, confidence, and classes values into range of 0 – 1.

            box_centers = tf.sigmoid(box_centers)
            confidence = tf.sigmoid(confidence)
            classes = tf.sigmoid(classes)

Then convert box_shapes as the following:

            anchors = tf.tile(anchors, [out_shape[1] * out_shape[2], 1])
            box_shapes = tf.exp(box_shapes) * tf.cast(anchors, dtype=tf.float32)

Use a meshgrid to convert the relative positions of the center boxes into the real positions.

            x = tf.range(out_shape[1], dtype=tf.float32)
            y = tf.range(out_shape[2], dtype=tf.float32)

            cx, cy = tf.meshgrid(x, y)
            cx = tf.reshape(cx, (-1, 1))
            cy = tf.reshape(cy, (-1, 1))
            cxy = tf.concat([cx, cy], axis=-1)
            cxy = tf.tile(cxy, [1, n_anchors])
            cxy = tf.reshape(cxy, [1, -1, 2])

            strides = (input_image.shape[1] // out_shape[1], \
                       input_image.shape[2] // out_shape[2])
            box_centers = (box_centers + cxy) * strides

Then, concatenate them all together.

            prediction = tf.concat([box_centers, box_shapes, confidence, classes], axis=-1)

Big note: Just to remain you that YOLOv3 does 3 predictions across the scale. We do as it is.

Take the prediction result for each scale and concatenate it with the others.

            if scale:
                out_pred = tf.concat([out_pred, prediction], axis=1)
            else:
                out_pred = prediction
                scale = 1

Since the route and shortcut layers need output feature maps from previous layers, so for every iteration, we always keep the track of the feature maps and output filters.

        outputs[i] = inputs
        output_filters.append(filters)

Finally, we can return our model.

    model = Model(input_image, out_pred)
    model.summary()
    return model

The Complete Code of the yolov3.py

#yolov3.py
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import BatchNormalization, Conv2D, \
    Input, ZeroPadding2D, LeakyReLU, UpSampling2D




def parse_cfg(cfgfile):
    with open(cfgfile, 'r') as file:
        lines = [line.rstrip('\n') for line in file if line != '\n' and line[0] != '#']
    holder = {}
    blocks = []
    for line in lines:
        if line[0] == '[':
            line = 'type=' + line[1:-1].rstrip()
            if len(holder) != 0:
                blocks.append(holder)
                holder = {}
        key, value = line.split("=")
        holder[key.rstrip()] = value.lstrip()
    blocks.append(holder)
    return blocks


def YOLOv3Net(cfgfile, model_size, num_classes):

    blocks = parse_cfg(cfgfile)

    outputs = {}
    output_filters = []
    filters = []
    out_pred = []
    scale = 0

    inputs = input_image = Input(shape=model_size)
    inputs = inputs / 255.0

    for i, block in enumerate(blocks[1:]):
        # If it is a convolutional layer
        if (block["type"] == "convolutional"):

            activation = block["activation"]
            filters = int(block["filters"])
            kernel_size = int(block["size"])
            strides = int(block["stride"])

            if strides > 1:
                inputs = ZeroPadding2D(((1, 0), (1, 0)))(inputs)

            inputs = Conv2D(filters,
                            kernel_size,
                            strides=strides,
                            padding='valid' if strides > 1 else 'same',
                            name='conv_' + str(i),
                            use_bias=False if ("batch_normalize" in block) else True)(inputs)

            if "batch_normalize" in block:
                inputs = BatchNormalization(name='bnorm_' + str(i))(inputs)
            #if activation == "leaky":
                inputs = LeakyReLU(alpha=0.1, name='leaky_' + str(i))(inputs)

        elif (block["type"] == "upsample"):
            stride = int(block["stride"])
            inputs = UpSampling2D(stride)(inputs)

        # If it is a route layer
        elif (block["type"] == "route"):
            block["layers"] = block["layers"].split(',')
            start = int(block["layers"][0])

            if len(block["layers"]) > 1:
                end = int(block["layers"][1]) - i
                filters = output_filters[i + start] + output_filters[end]  # Index negatif :end - index
                inputs = tf.concat([outputs[i + start], outputs[i + end]], axis=-1)
            else:
                filters = output_filters[i + start]
                inputs = outputs[i + start]

        elif block["type"] == "shortcut":
            from_ = int(block["from"])
            inputs = outputs[i - 1] + outputs[i + from_]

        # Yolo detection layer
        elif block["type"] == "yolo":

            mask = block["mask"].split(",")
            mask = [int(x) for x in mask]
            anchors = block["anchors"].split(",")
            anchors = [int(a) for a in anchors]
            anchors = [(anchors[i], anchors[i + 1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in mask]
            n_anchors = len(anchors)

            out_shape = inputs.get_shape().as_list()

            inputs = tf.reshape(inputs, [-1, n_anchors * out_shape[1] * out_shape[2], \
										 5 + num_classes])

            box_centers = inputs[:, :, 0:2]
            box_shapes = inputs[:, :, 2:4]
            confidence = inputs[:, :, 4:5]
            classes = inputs[:, :, 5:num_classes + 5]

            box_centers = tf.sigmoid(box_centers)
            confidence = tf.sigmoid(confidence)
            classes = tf.sigmoid(classes)

            anchors = tf.tile(anchors, [out_shape[1] * out_shape[2], 1])
            box_shapes = tf.exp(box_shapes) * tf.cast(anchors, dtype=tf.float32)

            x = tf.range(out_shape[1], dtype=tf.float32)
            y = tf.range(out_shape[2], dtype=tf.float32)

            cx, cy = tf.meshgrid(x, y)
            cx = tf.reshape(cx, (-1, 1))
            cy = tf.reshape(cy, (-1, 1))
            cxy = tf.concat([cx, cy], axis=-1)
            cxy = tf.tile(cxy, [1, n_anchors])
            cxy = tf.reshape(cxy, [1, -1, 2])

            strides = (input_image.shape[1] // out_shape[1], \
                       input_image.shape[2] // out_shape[2])
            box_centers = (box_centers + cxy) * strides

            prediction = tf.concat([box_centers, box_shapes, confidence, classes], axis=-1)

            if scale:
                out_pred = tf.concat([out_pred, prediction], axis=1)
            else:
                out_pred = prediction
                scale = 1

        outputs[i] = inputs
        output_filters.append(filters)

    model = Model(input_image, out_pred)
    model.summary()
    return model

That’s it for part 2 and see you in part 3.

I have another tutorial that I highly recommend reading. It provides detailed instructions on how to load and visualize the COCO dataset using custom code.

Parts:

Credit link:
https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/

Leave a Reply

Your email address will not be published. All fields are required

What others say

  1. Howdy! I simply want to offer you a big thumbs up for the great info you have got here on this post. I will be returning to your blog for more soon. Helaina Fraser Tillio

  2. Hi there to all, how is all, I think every one is getting more from this web site, and your views are pleasant in favor of new visitors. Jaclin Michel Haveman

  3. Hey, I was wondering how I can put this code into Arduino Nano 33 BLE and what would the path of live video be if OV7675 camera in Arduino Tiny Machine Learning kit is being used

  4. Hello,

    Thank you for this amazing tutorial, it really helps me understand the YOLOv3 architecture and how to implement it with TensorFlow 2. Would you please be able to add a MIT License to your code on GitHub? I’d like to play around with this tutorial.