Building a general usable framework for Computer Vision Task competitions

Machine Learning Engineer

Hi, my name is Qishen Ha, working for LINE Corp. as a Machine Learning Engineer, the 11th ranked Kaggle Grandmaster in the world now, mainly working on computer vision problems like image classification, semantic segmentation or object detection.

(My Kaggle homepage)

I am very honored to be a Z by HP & NVIDIA data science ambassador, and I am very grateful to Z by HP & NVIDIA for giving me this opportunity and providing me with the Z8G4 workstation and ZBook Studio. This has increased my competitiveness in the kaggle competitions.

Today I want to talk about my code framework used for the competition. In particular, the computer vision competition.

My Public Notebooks

Over the year I've made a number of CNN training notebooks public available, either as a baseline model or as a minimal version of the top solution (because the full version is too large in terms of code and training).

If any of you have read these notebooks of mine, you will see that although these notebooks use different data and train for different tasks, they all have the same basic framework. This is called a generic framework. Using such a framework, when we encounter a new competition, we can train a new baseline model in the shortest possible time and it is also very easy to improve or maintain it subsequently.

Next I will summarise the framework I used in these notebooks and introduce it after all. This is of course the framework I am used to and I think it is very easy to use. If you already have a framework that you are used to, there is no need to copy mine exactly, but rather just get the ideas from it.

Introduction of my framework

In my framework, there are several generic basic modules as follows

● Dataset
● Augmentation
● Model
● Loss Function

I will introduce them one by one next.


Dataset defines how we read the data, how we pre-process the data, how we read the labels and how we deal with them. The picture below shows one of the most basic code structures of a dataset.


This is a simple image classification task. We use cv2 to read the image into memory, then augmentation and pre-process it, and finally return the processed image and label.

This is a very generic code style and requires only very minor modifications when we need to adapt it to the image segmentation task. The following picture shows the most important modifications.

Lastly, we adjust the data type and dimension of the mask, replace row.label and return it.

In this way, we can easily modify the Dataset, read the data we want, pre-process it as we wish and so on.



You may have noticed that in the Dataset there is a parameter called transforms which contains the augmentation methods that we will use, and these methods are defined in the section on Augmentation.

The picture below shows a simple definition of Augmentations. In training we use horizontal flip, and resize, while in validation we only use resize.



If we want to add more Augmentation methods to this, we can simply add to it, as shown below.


Like this, we have added random rotation and blur to the training process.



In this subsection we need to define the structure of the model. Let's still take the simplest example - the model structure for the image classification task - as a reference.

Typically, in an image classification task, we create an imagenet pretrained model, such as efficientnet, as a backbone, delete its own linear layer of 1000 classes (the imagenet dataset is a 1000-class dataset), and add our own linear layer of n classes. As shown in the figure below.


If we wanted to add a dropout before the last layer of FC, we could simply write it like the following.

Another common scenario is that the input image may not be RGB 3 channels, but 4 channels or more. In this case we can change the input of the first convolution layer of the backbone to what we want, as in the image below.

Here n_ch is the number of channels we have as input. By writing like this we not only change the number of input channels to what we want, but we also keep using the imagenet's pretrained weights for the first conv layer.


Loss Function

The easiest way to define a Loss Function is as follows. This is also the most common way.





However, it is also very easy to change to a complex look, such as the following.



In this loss function, we use cross entropy loss for the first four outputs and BCE loss for the others, and add loss weight to balance the two losses, which makes the logic more complex but does not require much code change. We used this loss to win first place in the RANZCR competition.



These are the four basic modules of my framework, all of which are designed to be very easy to extend. Combined they form the framework that I use. When using this framework for experiments, I keep a notebook for each experiment, which is useful for analysing the results and reproducing them.

For more information, you can go to my kaggle homepage (https://www.kaggle.com/haqishen) and find the notebooks I've shared, and I'm sure you'll find more useful information in these notebooks.