ML Reproducibility Challenge 2021 - Does Self-Supervision Always Improve Few-Shot Learning?

Open Source Machine Learning Feb 21, 2022

This blog narrates the experience and journey of my team working on the ML Reproducibility Challenge 2021.

Links: Paper, DAGsHub codebase, DAGsHub reproducibility report.

Background and Choice of Paper

This section explains the motivation behind the questions tackled by the original paper and describes why we chose to reproduce this specific paper

My background in machine learning was more on the applied research side of it, specifically in Computer Vision. That would mean that I was very familiar with working with large annotated visual datasets to solve specific vision problems such as depth estimation, object detection, and other well-defined tasks by training deep nets on large datasets.

In recent times, the need for dynamic machine learning systems has come up a lot. This is in comparison to how previous ML models worked:

Require a lot of data to train once.
Can not learn from more data once trained.
Can not easily adapt to distribution shifts.

Specifically, the need to learn with lesser data strikes a chord with many real-world applications where the data is insufficient for current-day deep nets. An example would be in the initial stages of the COVID-19 pandemic, if we were to train a machine learning classifier to classify COVID-related anomalies in a chest X-ray image, we would need:

A huge amount of positive and negative images
Annotations for each of these images, corresponding to whatever we would like the model to predict

This would be impossible since the disease is new, never seen before, and the need for our brilliant machines to help us is right now. But, machine learning fails here due to the non-availability of both these requirements.

Few-shot learning (FSL) is an area that aims to develop ML algorithms that can learn new concepts with very few data points. The idea behind every few-shot learning algorithm is to leverage data from other classes (base data / a large dataset such as ImageNet) and learn how to effectively establish a transfer of knowledge from these classes when we want to learn a new concept with fewer data points.

Our first reason for choosing this problem is the practicality of learning with lesser data.

Next, coming to our second condition, a very popular space of algorithms that aim to learn strong representations with no annotations have risen in popularity recently, bearing the name Self-Supervised Learning (SSL). The core idea behind these algorithms is to encode invariances well-known to us into these algorithms to learn useful representations that encode images in a feature space. An example of an invariance would be that of predicting rotation degrees or maximizing the similarity between an image and its augmented version.

A good question to ask would be – Since SSL learns better representations anyway, can we just use SSL to learn stronger (and hence more transferrable) representations for FSL. This was the question the paper we picked, tackled.

The second reason for choosing this paper was how it tackled a very simple question and investigated why it should / should not work to solve our problem.

Starting the Challenge

This section describes the plan that we had in mind when we started the challenge.

We worked on our reproducibility starting May 27th, 2021, until September 29th, 2021, spanning four months in total.

Our plan was to first read the paper and understand its experiments in detail. However, it is always the case that once you look at the code, it’s easier to understand the paper. So, we first read the code top-to-bottom and correlated the experiments reported in the paper to various files/functions in the code, and made sure that we understood the algorithms well, both in terms of theory and implementation.

Our plan then was to set up a GPU system to run our experiments, customize the codebase, reproduce each experiment, correct errors in the code if necessary, and finally build experiments on top of the paper and write/run code for them, before we start writing up our results.

Thanks to the grants that we received from DagsHub, we had the right amount of computing to run all our experiments successfully and set up our GPU system in the cloud.

Furthermore, we were in close communication with the first authors all the while and received lots of help regarding implementations and experiment details. We acknowledge the authors for their help with the reproducibility of their paper.

Coming to Datasets

As for any ML experiment, we need benchmark datasets to run our experiments on for easy comparison to other work.

Our specific paper had experimented with 6 different datasets, and ambitiously, we planned to do the same, since each of them was diverse and had its own interesting characteristics. Further, additional datasets are always better when it comes to results.

We expected that all of these would be easily downloadable and accessible, to be used by the code. However, we found that 3 out of the 6 datasets have no versions that could be used directly. One example is the mini-ImageNet dataset. No generic public source was available; all available sources had already preprocessed the data in a way that would make it unsuitable for us. So we had to download the entire ImageNet dataset (~ 150 GB) and then filter it in order to get to mini-ImageNet. Similarly, for two other datasets, we had to go through an excruciatingly-difficult pipeline.

So, we developed usable versions of the 3 datasets for our code, which took up a good amount of time. Furthermore, to save the time of other practitioners from the same process, we published the datasets online for anyone to download, without hassle.

Code Correction and Implementing Algorithms From Scratch

Now that we had usable versions of the datasets, we began trying to use the authors' published codebase. There came our next hurdle - we found plenty of code errors, and what’s more, we found that a core contribution of the paper (an algorithm to select data from different domains to use for training) was missing in the code. We then had to implement the algorithm from scratch.

Moreover, trivially correcting the errors did not give us the same results as those reported in the paper. This was because the authors used a different set of hyperparameters for each result reported in the paper, but did not provide them in either the paper or its appendix.

So, we had to run a hyperparameter sweep for each setup that we operated on. It all took a whopping 980 GPU hours, in terms of computing. Finally, we were able to begin matching some of the results reported in the paper to those of our own experiments.

Coming to Useful Conclusions

The conclusions of our original paper were as follows:

SSL improves FSL when used in the pretraining step, across various domains.
The benefits of self-supervision increase with the difficulty of the task. For example: when training with a base dataset with less labelled data, or with images of lesser quality/resolution
Additional unlabelled data from dissimilar domains, when used for self-supervision, negatively impacts the performance of few-shot learners.
The proposed (for us, reimplemented) domain selection algorithm can alleviate this issue by learning to pick images from a large and generic pool of images.

Finally, after lots of runs and sweeps, we found that the conclusions of the paper held true, as we could see by reproducing the experiments that the paper had reported results on and inferred from.

Beyond the Original Paper

However, we did not stop at that point and, throughout the course of experimenting, we have developed intuitions for additional experiments that we wanted to run.

To start with, we want to check if the resolution of the images made a difference, relying on the assumption that SSL would need a good image size to work well. Next, we wanted to check if SSL really improved the performances of FSL algorithms when the testing domain was different from the training domain. This, as a lot of recent papers have claimed, is a more practical setting than FSL itself, because it assumes we have NO DATA (i.e. not even a large number of base data images) to pretrain our model on. Further, this tests the classifiers' ability to learn new domains quickly from fewer data.

Again repeating the process for our new experiments, we demonstrate in our work that the gains do not hold when the input image size and architecture differ from those reported in the paper. While this directly contradicts the results of the paper, it is a special case that the authors either missed or chose not to report.

Next, on the more practical cross-domain few-shot learning setup, we find that self-supervision does not help ImageNet-trained few-shot learners generalize to new domains better. This is an extension to the paper that showcases the limitations of SSL for FSL, something that the authors did not experiment on.

Wrapping Up

The conclusion of our work is to encourage future papers on FSL to also experiment with other image resolutions and on the cross-domain setup, citing their equal practical importance as the main problem.

Overall, we enjoyed the process, although it was difficult at times. We were happy with finally reproducing the paper after a lot of roadblocks, but we kept digging further, until our hunches worked and exposed a limitation of the reported algorithm.

We encourage everyone to pick an interesting paper of their choice and persevere until they successfully reproduce the paper (and possibly, extend its results). I hope this blog serves as a useful resource. Feel free to reach out to me for any related help! :)