I previously wrote a short introduction to Artificial Intelligence. I promised I would keep writing about the subject. One of the goals of this series of blog posts for me personally is to practice explaining complicated concepts simply. Here, I'll very briefly discuss computer vision and the process of training and using Convolutional Neural Networks.

The last section of this post will contain a quick step by step guide to implemeting a very simple object recognition application using the OpenCV toolkit for Python, in case you would like to try this out for yourself.

What is Computer Vision?


Computer Vision is a subset of AI. It's a set of tools which together can be used to train computers to interpret images. Since video is just a sequence of images, we can also use this technology to extract data from video. Artificial Neural Networks can be trained to detect and classify objects in images with increasing accuracy.
This field has been growing tremendously alongside it's various application in very recent years. These techniques are commonly used in the development of self driving cars, which need to be aware of their surroundings. This allows a self driving car to see the outline of a road, stop signs, pedestrians, etc. This envoronmental awareness allows cars to learn to react to their environments.

self_driving_issuessource: https://xkcd.com/1958/

Convolutional and Residual Neural Networks

In my previous post I briefly discussed two types of simple Neural Networks, To briefly recap, neural networks are composed of multiple neurons that have learnable "weights". Each neuron recieves input and computes an output. I described the perceptron, and multi-layer perceptrons, aka MLPs. Here, I'll explain another type of Neural Network, which was designed to more efficiently understand images, by mimicking our visual cortex, how the human brain processes light passing through our eyes. They are called Convolutional Neural Networks, or CNNs (not the news network).


A convolutional neural network differs slightly from an MLP. In MLPs, layers are only one dimensional, a linear number of neurons per layer. CNN layers are 3 dimensional. These three dimensions of the input layer can represent image data, where the first represents width, the second height, and a third for color data (RGB values).

1523891796216source: mathworks

After the input layer, a set of tens or even hundreds of learning layers will process different, small parts of the input. Each layer will learn to detect different features of an image. There are a couple of different types of feature learning layers, such as RELU layers (rectifiers) and pooling layers. In the above diagram, after going through each learning layer, the output is flattened into a single dimension. Now we can use a more traditional fully connected, dense network, (resembling the previously mentioned MLP) for classification.

Notable Network Architectures

There are many variations of the standard CNN. A notable one is VGG16, Very Deep Convolutional Networks for Large-Scale Image Recognition. It was developed to explore the benefits of depth in a detwork, it's number neurons and layers. It turns out that the increased depth significantly improved accuracy, but there are limitations. One such limitation is the so-called vanishing/exploding gradient problem. Vanishing gradients cause early layers to learn more slowly when additional layers are added to a CNN. Another type of network, Residual Networks (ResNet) was developed quite recently. One of the motivations was to alleviate this problem, by introducing so-called identity shortcuts, which allow gradients to skip one or more layers. The image below shows a residual block, the fundamental building block of a Residual Network. Check out the reseach paper for more details..

Residual Block

We will use a Residual Network, ResNet50 in our Python example at the end of this post. Another highly notable network is the Single Shot MultiBox Detector, which I aim to describe in detail in a later blog post. It's based on the VGG16 network, but contains additional layers which are used for multi-object classification.

Datasets for Object Detection and Recognition


ImageNet, as described by Krizhevsky et. al in their paper, is an image database organized according to the WordNet hierarchy. It has advanced the field of computer vision and object recognition significantly by providing an unprecedented amount of images, with a vast array of categories. The list below shows some general statistics about this dataset. Synsets are categories, including high-level categories and their subcategories. SIFT (Scale-Invariant feature transform) are a sort of feature descriptions, obtained with an algorithm that can detect edges of objects within an image, for example by finding high contrast keypoints.

  • Total number of non-empty synsets: 21841
  • Total number of images: 14,197,122
  • Number of images with bounding box annotations: 1,034,908
  • Number of synsets with SIFT features: 1000
  • Number of images with SIFT features: 1.2 million

ImageNet Synsets

COCO - Common Objects In Context

Microsoft's COCO (Common Objects in Context) is a more recent, groundbreaking image database. It contains photos of 91 objects with a total of 2.5 million labeled instances in 328k images. Compared to ImageNet, COCO contains fewer categories, but a larger amount of labels. Objects within images have semantic pixel-level segmentation. Semantic labeling requires each individual pixel of an object to be labeled as belonging to a category. This allows extraction of the shape of an object, as can be seen in the image below. It's a sample image from the COCO database, containing labels and silhouettes of objects. TensorFlow's Object Recognition API provides a number of object recognition models, which were all trained with the COCO dataset. See TensorFlow's ModelZoo for a complete list.

A Labeled Image from the COCO database

I highly recommend the research paper on the COCO Database.

Lets build a real time object recogizer in Python

We're gonna need some tools.


OpenCV is a computer vision library. It provides interfaces for various programming languages and supports all major operating systems. It was designed with computational efficiency in mind, with a strong focus on real-time applications.


Keras will greatly simplify the process of working with neural networks. It's a high-level neural networks API, written in Python and capable of working with TensorFlow, Theano and other machine learning frameworks.

Pre Trained Models

Instead of designing our own neural network from skratch and gathering terabytes of images to train it, we will use a pre-trained model. Several pre-trained models are readily available, and can be loaded directly from Keras. The one we'll be using is called ResNet50.

If you'd like to go all sciency and learn the details of ResNet50, here's the original research paper on arxiv.org

The nine Deep Learning papers you need to know about

A simple python implementation

Let's first classify a single object in a single image. We'll need to import the previously mentioned tools.

#keras tools
from keras.models import Model
from keras.preprocessing import image
import keras.applications.resnet50 as res
import cv2 

Then, we can load a pretrained ResNet50 from Keras, read a file and load it with a Keras helper utility. We also need to read our image file using OpenCV. We'll use it and render a classification label on top of it.

#load the pre-trained model
model = res.ResNet50(weights="imagenet")
orig = cv2.imread(file)

Convert the image into to a numerical array. We'll need to preprocess it as well, so that it can be used with our ResNet50 classifier.

print("[INFO] loading and preprocessing image...")
img = image.load_img(file, target_size=(224, 224))
img = image.img_to_array(img)
#process image
img = np.expand_dims(img, axis=0)
img = res.preprocess_input(img)

Classify the image using our trained model, get it's ID, Label, and the accuracy of the prediction.

#Classify the image
print("[INFO] classifying image...")
preds = model.predict(img)
(inID, label, accuracy) = res.decode_predictions(preds)[0][0]

Next, we'll display our predicted label and the accuracy on top of the image using the OpenCV tools.

# Display the predictions with OpenCV
print("ImageNet Label: %s - Accuracy: %d" % (label, accuracy))
cv2.putText(orig, "Label: {}".format(label), (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2)
cv2.imshow("Classification", orig)

Adjust our code to work with real time video

We should be a bit clever about resourse usage. We can split our code into two threads which will run simoultaneously. One will be used for image classificaion, while the other will handle rendering the video feed and prediction overlay.

Here's the full source code, adjusted for real time video:

from keras.models import Model
from keras.preprocessing import image
from keras.applications import resnet50
import cv2
import sys
import numpy as np
import threading

label = ''
score = 0.0
frame = None

class Thread(threading.Thread):
	def __init__(self):

	def run(self):
		global label
		global score
		# Load the ResNet50 model
		print("[INFO] loading network...")
		self.model = resnet50.ResNet50(weights="imagenet")
		while (~(frame is None)):
			(inID, label, score) = self.predict(frame)[0]

	def predict(self, frame):
		img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB).astype(np.float32)
		img = img.reshape((1,) + img.shape)
		img = resnet50.preprocess_input(img)
		preds = self.model.predict(img)
		return resnet50.decode_predictions(preds)[0]

cap = cv2.VideoCapture(0)
if (cap.isOpened()):
	print("Camera OK")

keras_thread = Thread()

while (True):
	ret, original = cap.read()
	frame = cv2.resize(original, (224, 224))
	cv2.putText(original, "Label: %s - Accuracy: %d" % (label, score), (10, 30), cv2.FONT_HERSHEY_PLAIN, 0.9, (0, 255, 0), 2)
	cv2.imshow("Classification", original)
	if (cv2.waitKey(1) & 0xFF == ord('q')):

frame = None

Next steps

This blog post is only barely scraping the surface of the field of computer vision and object recognition. I'm currently researching this subject at Leiden University. I hope to write a follow-up soon, discussing multi-object detection with bounding boxes and how to build one of those with TensorFlow and a Single Shot MultiBox Detector.


Links, references and resources