Alex Kendall is a Research Fellow in Computer Vision and Robotics at the University of Cambridge. His research has been used to control self-driving cars, power smart-city infrastructure and enable next-generation drone flight.
Today I can share my final PhD thesis, which I submitted in November 2017. It was examined by Dr. Joan Lasenby and Prof. Andrew Zisserman in February 2018 and has just been approved for publication.
This thesis presents the main narrative of my research at the University of Cambridge, under the supervision of Prof Roberto Cipolla.
It contains 206 pages, 62 figures, 24 tables and 318 citations.
You can download the complete .pdf here.
My thesis presents contributions to the field of computer vision, the science which enables machines to see.
This blog post introduces the work and tells the story behind this research.
Deep learning and convolutional neural networks have become the dominant tool for computer vision. These techniques excel at learning complicated representations from data using supervised learning. In particular, image recognition models now out-perform human baselines under constrained settings. However, the science of computer vision aims to build machines which can see. This requires models which can extract richer information than recognition, from images and video. In general, applying these deep learning models from recognition to other problems in computer vision is significantly more challenging.
This thesis presents end-to-end deep learning architectures for a number of core computer vision problems; scene understanding, camera pose estimation, stereo vision and video semantic segmentation. Our models outperform traditional approaches and advance state-of-the-art on a number of challenging computer vision benchmarks. However, these end-to-end models are often not interpretable and require enormous quantities of training data.
To address this, we make two observations: (i) we do not need to learn everything from scratch, we know a lot about the physical world, and (ii) we cannot know everything from data, our models should be aware of what they do not know. This thesis explores these ideas using concepts from geometry and uncertainty. Specifically, we show how to improve end-to-end deep learning models by leveraging the underlying geometry of the problem. We explicitly model concepts such as epipolar geometry to learn with unsupervised learning, which improves performance. Secondly, we introduce ideas from probabilistic modelling and Bayesian deep learning to understand uncertainty in computer vision models. We show how to quantify different types of uncertainty, improving safety for real world applications.
The story
I began my PhD in October 2014, joining the controls research group at Cambridge University Engineering Department.
Looking back at my original research proposal, I said that I wanted to work on the ‘engineering questions to control autonomous vehicles… in uncertain and challenging environments.’
I spent three months or so reading literature, and quickly developed the opinion that the field of robotics was most limited by perception.
If you could obtain a reliable state of the world, control was often simple.
However, at this time, computer vision was very fragile in the wild.
After many weeks of lobbying Prof. Roberto Cipolla (thanks!), I was able to join his research group in January 2015 and begin a PhD in computer vision.
When I began reading computer vision literature, deep learning had just become popular in image classification, following inspiring breakthroughs on the ImageNet dataset.
But it was yet to become ubiquitous in the field and be used in richer computer vision tasks such as scene understanding.
What excited me about deep learning was that it could learn representations from data that are too complicated to hand-design.
I initially focused on building end-to-end deep learning models for computer vision tasks which I thought were most interesting for robotics, such as scene understanding (SegNet) and localisation (PoseNet).
However, I quickly realised that, while it was a start, applying end-to-end deep learning wasn’t enough.
In my thesis, I argue that we can do better than naive end-to-end convolutional networks.
Especially with limited data and compute, we can form more powerful computer vision models by leveraging our knowledge of the world.
Specifically, I focus on two ideas around geometry and uncertainty.
Geometry is all about leveraging structure of the world. This is useful for improving architectures and learning with self-supervision.
Uncertainty understands what our model doesn’t know. This is useful for robust learning, safety-critical systems and active learning.
Over the last three years, I have had the pleasure of working with some incredibly talented researchers, studying a number of core computer vision problems from localisation to segmentation to stereo vision.
Bayesian deep learning for modelling uncertainty in semantic segmentation.
The science
This thesis consists of six chapters. Each of the main chapters introduces an end-to-end deep learning model and discusses how to apply the ideas of geometry and uncertainty.
Chatper 1 - Introduction. Motivates this work within the wider field of computer vision.
Chapter 2 - Scene Understanding. Introduces SegNet, modelling aleatoric and epistemic uncertainty and a method for learning multi-task scene understanding models for geometry and semantics.
Chapter 3 - Localisation. Describes PoseNet for efficient localisation, with improvements using geometric reprojection error and estimating relocalisation uncertainty.
Chapter 4 - Stereo Vision. Designs an end-to-end model for stereo vision, using geometry and shows how to leverage uncertainty and self-supervised learning to improve performance.
Chapter 5 - Video Scene Understanding. Illustrates a video scene understanding model for learning semantics, motion and geometry.
Chapter 6 - Conclusions. Describes limitations of this research and future challenges.
An overview of the models considered in this thesis.
As for what’s next?
This thesis explains how to extract a robust state of the world – semantics, motion and geometry – from video.
I’m now excited about applying these ideas to robotics and learning to reason from perception to action.
I’m working with an amazing team on autonomous driving, bringing together the worlds of robotics and machine learning.
We’re using ideas from computer vision and reinforcement learning to build the most data-efficient self-driving car.
And, we’re hiring, come work with me! wayve.ai/careers
I’d like to give a huge thank you to everyone who motivated, distracted and inspired me while writing this thesis.
Here’s the bibtex if you’d like to cite this work.
@phdthesis{kendall2018phd,
title={Geometry and Uncertainty in Deep Learning for Computer Vision},
author={Kendall, Alex},
year={2018},
school={University of Cambridge}
}
And the source code for the latex document is here.
2017 was an exciting year as we saw deep learning become the dominant paradigm for estimating geometry in computer vision.
Learning geometry has emerged as one of the most influential topics in computer vision over the last few years.
“Geometry is … concerned with questions of shape, size, relative position of figures and the properties of space” (wikipedia).
We’ve first seen end-to-end deep learning models for these tasks using supervised learning, for example depth estimation (Eigen et al. 2014), relocalisation (PoseNet 2015), stereo vision (GC-Net 2017) and visual odometry (DeepVO 2017) are examples.
Deep learning excels at these applications for a few reasons.
Firstly, it is able to learn higher order features which reason over shapes and objects with larger context than point-based classical methods.
Secondly, it is very efficient for inference to simply run a forward pass of a convolutional neural network which approximates an exact geometric function.
Over the last year, I’ve noticed epipolar geometry and reprojection losses improving these models, allowing them to learn with unsupervised learning.
This means they can train without expensive labelled data by just observing the world.
Reprojection losses have contributed to a number of significant breakthroughs which now allow deep learning to outperform many traditional approaches to estimating geometry.
Specifically, photometric reprojection loss has emerged as the dominant technique for learning geometry with unsupervised (or self-supervised) learning.
We’ve seen this across a number of computer vision problems:
Monocular Depth: Reprojection loss for deep learning was first presented for monocular depth estimation by Garg et al. in 2016.
In 2017, Godard et al. show how to formulate left-right consistency checks to improve results.
Stereo Depth: in my PhD thesis I show how to extend our stereo architecture, GC-Net, to learn stereo depth with epipolar geometry & unsupervised learning.
Localisation: I presented a paper at CVPR 2017 showing how to train relocalisation systems by learning to project 3D geometry from structure from motion models Kendall & Cipolla 2017.
Ego-motion: learning depth and ego motion with reprojection loss now out performs traditional methods like ORB-SLAM over short sequences under constrained settings (Zhou et al. 2017) and (Li et al. 2017).
Multi-View Stereo: projection losses can also be used in a supervised setting to learn structure from motion, for example DeMoN and SfM-Net.
In this blog post I’d like to highlight the importance of epipolar geometry and how we can use it to learn representations of geometry with deep learning.
An example of state of the art monocular depth estimation with unsupervised learning using reprojection geometry (Godard et al. 2017)
What is reprojection loss?
The core idea behind reprojection losses is using epipolar geometry to relate corresponding points in multi-view stereo imagery.
To dissect this jargon-filled sentence; epipolar geometry relates the projection of 3D points in space to 2D images.
This can be thought of as triangulation (see the figure below).
The relation between two 2D images is defined by the Fundamental matrix.
If we choose a point on one image and know the fundamental matrix, then this geometry tells us that the same point must lie on a line in the second image, called the epipolar line (the red line in the figure below).
The exact point of the correspondence on the epipolar line is defined by the 3D point’s depth in the scene.
If these two images are from a rectified stereo camera then this is a special type of multi-view geometry, and the epipolar line is horizontal.
We then refer to the corresponding point’s position on the epipolar line as disparity.
Disparity is inversely proportional to metric depth.
The standard reference for this topic is the textbook, “Multiple View Geometry in Computer Vision” Hartley and Zisserman, 2004.
Epipolar geometry relates the same point in space seen by two cameras and can be used to learn 3D geometry from multi-view stereo (Image borrowed from Wikipedia).
One way of exploiting this is learning to match correspondences between stereo images along this epipolar line.
This allows us to estimate pixel-wise metric depth.
We can do this using photometric reprojection loss (Garg et al. in 2016).
The intuition behind reprojection loss is that pixels representing the same object in two different camera views look the same.
Therefore, if we relate pixels, or determine correspondences between two views, the pixels should have identical RGB pixel intensity values.
The better the estimate of geometry, the closer the photometric (RGB) pixel values will match.
We can optimise for values which provide matching pixel intensities between each image, known as minimising the photometric error.
An important property of these losses is that they are unsupervised.
This means that we can learn these geometric quantities by observing the world, without expensive human-labelled training data.
This is also known as self-supervised learning.
The list of papers at the start of this post further extend this idea to optical flow, depth, ego-motion, localisation etc. — all containing forms of epipolar geometry.
Does this mean learning geometry with deep learning is solved?
I think there are some short-comings to reprojection losses.
Firstly, photometric reprojection loss makes a photometric consistency assumption.
This means it assumes that the same surface has the same RGB pixel value between views.
This assumption is usually valid for stereo vision, because both images are taken at the same time.
However, this is not always the case for learning optical flow or multi-view stereo, because appearance and lighting changes over time.
This is because of occlusion, shadows and the dynamic nature of scenes.
Secondly, reprojection suffers from the aperture problem.
The aperture problem is unavoidable ambiguity of structure due to a limited field of view.
For example, if we try to learn depth by photometric reprojection, our model cannot learn from areas with no texture, such as sky or featureless walls.
This is because the reprojection loss is equal across areas of homogeneous texture. To resolve the correct reprojection we need context!
This problem is usually resolved by a smoothing prior, which encourages the output to be smooth where there is no training signal, but this also blurs correct structure.
Thirdly, we don’t need to reconstruct everything.
Learning to reproject pixels is similar to an auto encoder — we learn to encode all parts of the world equally.
However, for many practical applications, attention based reasoning has been shown to be most effective.
For example, in autonomous driving we don’t need to learn the geometry of building facades and the sky, we only care about the immediate scene in front of us.
However, reprojection losses will treat all aspects of the scene equally.
State of the art stereo depth estimation from GC-Net. This figure shows the saliency of the model with respect to the depth prediction for the point with the white cross. This demonstrates the model uses a wider context of the surrounding car and road to make its prediction.
How can we improve performance?
It is difficult to learn geometry alone, I think we need to incorporate semantics.
There is some evidence that deep learning models learn semantic representations implicitly from patterns in the data.
Perhaps our models could more explicitly exploit this?
I think we need to reproject into a better space than RGB photometric space. We would like this latent space to solve the problems above.
It should have enough context to address the aperture problem, be invariant to small photometric changes and emphasise task-dependant importance.
Training on the projection error in this space should result in a better performing model.
After the flurry of exciting papers in 2017, I’m looking forward to further advances in 2018 in one of the hottest topics in computer vision right now.
I first presented the ideas in this blog post at the Geometry in Deep Learning Workshop at the International Conference on Computer Vision 2017.
Thank you to the organisers for a great discussion.
When I am in the pub and I tell people I am working on Artificial Intelligence (AI) research, the conversation that invariably comes up is, “Why are you building machines to take all our jobs?”
However, within AI research communities, this topic is rarely discussed.
My experience with colleagues is that it is often dismissed with off-hand arguments such as, “We’ll make more advanced jobs to replace those which are automated”.
I’d like to pose the question to all AI researchers: how long have you actually sat down and thought about ethics?
Unfortunately, the overwhelming opinion is that we just build the tools and ethics are for the end-users of our algorithms to deal with.
But, I believe we have a professional responsibility to build ethics into our systems from the ground up.
In this blog I’m going to discuss how to build ethical algorithms.
How do we implement ethics? There are a number of ethical frameworks which philosophers have designed;
virtue ethics (Aristotle), deontological ethics (Kant) and teleological ethics (Mill, Bentham).
However Toby Walsh, Professor of AI at the University of New South Wales in Australia, has a different view:
“For too long philosophers have come up with vague frameworks. Now that we have to write code to actually implement AI, it forces us to define a much more precise and specific moral code”
Therefore, AI researchers may even have an opportunity to make fundamental contributions to our understanding of ethics!
I have found it interesting to think about what ethical issues such as trust, fairness and honesty mean for AI researchers.
Concrete ethical issues for machine learners
In this section I am going to discuss concrete ways to implement trust, fairness and honesty in AI models.
I will try to translate these ethical topics into actual machine learning problems.
Trust
It is critical that users trust AI systems, otherwise their acceptance in society will be jeopardised.
For example, if a self-driving car is not trustworthy, it is unlikely anyone will want to use it.
Building trustworthy algorithms means we must make them safe by:
improving accuracy and performance of algorithms. We are more likely to trust something which is accurate.
This is something which most machine learning researchers do,
designing algorithms which are aware of their uncertainty and understand what they do not know.
This means they will not make ill-founded decisions.
See a previous blog I wrote on Bayesian deep learning for some ideas here,
making well-founded decision making or control policies which do not over-emphasise exploration over exploitation.
Fairness
We have a responsibility to make AI fair. This means to remove unwanted bias in algorithms.
Unfortunately, there are many examples of biased / unfair AI systems today. For example:
most self-driving cars are biased to training data in California.
There are in-fact many sources of bias in algorithms.
For an excellent taxonomy, see this paper.
In some situations we even want bias - but this is something we must understand.
Concrete problems to improve fairness and improve bias in AI systems include:
improving data efficiency to better learn rare classes,
improve methodologies for collecting and labelling data to remove training data bias,
improved causal reasoning so we can remove an algorithm’s access to explanatory variables which we deem morally unusable (e.g. race).
Honesty
Honesty requires algorithms to be transparent and interpretable.
We should expect algorithms to provide reasonable explanations for the decisions they make.
This is going to become a significant commercial concern in the EU when the new GDPR data laws go into effect in 2018.
They grant users the right to a logical explanation of how an algorithm uses our personal data.
This is going to require advances in:
saliency techniques which explain causality from input to output,
interpretablity of black box models. Models must be able to explain their reasoning.
This may be by forward simulation of an internal state or by analysing human interpretable intermediate outputs.
Will we be accountable for the AI we build?
Perhaps of more immediate concern to AI researchers is if we will be accountable for the algorithms we build.
If a self-driving car fails to obey the road rules and kills pedestrians, is the computer vision engineer at fault?
Another interesting example is the venture capitalist Deep Knowledge, who placed an AI on their board of directors in 2014.
What happens if this AI fails in its responsibilities as company director?
Today, AI cannot be legally or criminally accountable as it has no legal entity and owns no assets.
This means liability lies with the manufacturer.
Unfortunately, it is unlikely we will be able to get insurance to cover liability of autonomous systems.
This is because there is a total lack of data on them.
Without these statistics, insurance firms are unable to estimate risk profiles.
Most professional bodies (law, medicine, engineering) regulate standards and values of their profession.
In my undergraduate engineering school we had compulsory courses on professional standards and ethics.
Why do most computer science courses not do the same?
Medical research trials involving humans have to pass several ethics committee processes before being admitted.
I don’t think we need the same approval before running a deep learning experiment, but I think that there needs to be more awareness from the tech world.
For example, in 2014 Facebook conducted human subject trials which were widely condemned.
They wanted to see if they could modify human emotion by changing the types of content shown in their newsfeed.
Is this ethical? Did the 689,000 people involved willingly consent?
Why are tech companies exempt from the ethical procedures we place on other fields of research?
Is AI going to steal our jobs?
Going back to the original question I was asked in the pub, “will robots steal our jobs?” I think there are some real concerns here.
The AI revolution will be different to the industrial revolution. The industrial revolution lasted 100 years, which is 4 generations.
This was sufficient time for subsequent generations to be re-skilled in their jobs of the future.
The disruptive technology due to AI is likely to occur much more rapidly (perhaps a single generation).
We will need to re-skill within our lifetime, or else become jobless.
It is worth noting that in some situations, AI is going to increase employment.
For example, AI is drastically improving the efficiency of matching Kidney donors to those with Kidney disease, increasing the work for surgeons.
But these will certainly be isolated examples.
Disruptive AI technology is going to displace many of today’s jobs.
Hopefully automation will drastically reduce the cost of living.
Perhaps this places less pressure to have a job to work for money.
But it is well-known that humans need to feel self-worth to feel happy.
Perhaps entertainment and education will be enough for many people?
For those for whom it is not, we need to shift to a new distribution of jobs, quickly. Here are some positive ideas I like:
reactive retraining of those who have jobs displaced by automation. For example, Singapore has a retraining fund for people who have been replaced by automation,
proactive retraining, for example changing the way we teach accountants and radiologists today, because their jobs are being displaced,
allow automation to increase our leisure time,
redeploy labour into education and healthcare which require more human interaction than other fields,
introducing a living wage or universal basic income.
Eventually, perhaps more extreme regulation will be needed here to keep humans commercially competitive with robots.
Acknowledgements: I’d like to thank Adrian Weller for first opening my eyes to these issues.
This blog was written while attending the International Joint Conference on Artificial Intelligence (IJCAI) 2017
where I presented a paper on autonomous vehicle safety.
Thank you to the conference organisers for an excellent forum to discuss these topics.
Understanding what a model does not know is a critical part of many machine learning systems.
Unfortunately, today’s deep learning algorithms are usually unable to understand their uncertainty.
These models are often taken blindly and assumed to be accurate, which is not always the case.
For example, in two recent situations this has had disastrous consequences.
In May 2016 we tragically experienced the first fatality from an assisted driving system.
According to the manufacturer’s blog,
“Neither Autopilot nor the driver noticed the white side of the tractor trailer against a brightly lit sky, so the brake was not applied.”
In July 2015, an image classification system erroneously identified two African American humans as gorillas, raising concerns of racial discrimination.
See the news report here.
And I’m sure there are many more interesting cases too!
If both these algorithms were able to assign a high level of uncertainty to their erroneous predictions, then each system may have been able to make better decisions and likely avoid disaster.
It is clear to me that understanding uncertainty is important. So why doesn’t everyone do it?
The main issue is that traditional machine learning approaches to understanding uncertainty, such as Gaussian processes, do not scale to high dimensional inputs like images and videos.
To effectively understand this data, we need deep learning. But deep learning struggles to model uncertainty.
In this post I’m going to introduce a resurging field known as Bayesian deep learning (BDL), which provides a deep learning framework which can also model uncertainty.
BDL can achieve state-of-the-art results, while also understanding uncertainty.
I’m going to explain the different types of uncertainty and show how to model them.
Finally, I’ll discuss a recent result which shows how to use uncertainty to weight losses for multi-task deep learning.
The material for this blog post is mostly taken from my two recent papers:
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? Alex Kendall and Yarin Gal, 2017. (.pdf)
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. Alex Kendall, Yarin Gal and Roberto Cipolla, 2017. (.pdf)
And, as always, more technical details can be found there!
An example of why it is really important to understand uncertainty for depth estimation. The first image is an example input into a Bayesian neural network which estimates depth, as shown by the second image. The third image shows the estimated uncertainty. You can see the model predicts the wrong depth on difficult surfaces, such as the red car’s reflective and transparent windows. Thankfully, the Bayesian deep learning model is also aware it is wrong and exhibits increased uncertainty.
Types of uncertainty
The first question I’d like to address is what is uncertainty?
There are actually different types of uncertainty and we need to understand which types are required for different applications.
I’m going to discuss the two most important types – epistemic and aleatoric uncertainty.
Epistemic uncertainty
Epistemic uncertainty captures our ignorance about which model generated our collected data.
This uncertainty can be explained away given enough data, and is often referred to as model uncertainty.
Epistemic uncertainty is really important to model for:
Safety-critical applications, because epistemic uncertainty is required to understand examples which are different from training data,
Small datasets where the training data is sparse.
Aleatoric uncertainty
Aleatoric uncertainty captures our uncertainty with respect to information which our data cannot explain.
For example, aleatoric uncertainty in images can be attributed to occlusions (because cameras can’t see through objects) or lack of visual features or over-exposed regions of an image, etc.
It can be explained away with the ability to observe all explanatory variables with increasing precision.
Aleatoric uncertainty is very important to model for:
Large data situations, where epistemic uncertainty is mostly explained away,
Real-time applications, because we can form aleatoric models as a deterministic function of the input data, without expensive Monte Carlo sampling.
We can actually divide aleatoric into two further sub-categories:
Data-dependant or Heteroscedastic uncertainty is aleatoric uncertainty which depends on the input data and is predicted as a model output.
Task-dependant or Homoscedastic uncertainty is aleatoric uncertainty which is not dependant on the input data.
It is not a model output, rather it is a quantity which stays constant for all input data and varies between different tasks. It can therefore be described as task-dependant uncertainty.
Later in the post I’m going to show how this is really useful for multi-task learning.
Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation. You can notice that aleatoric uncertainty captures object boundaries where labels are noisy. The bottom row shows a failure case of the segmentation model, when the model is unfamiliar with the footpath, and the corresponding increased epistemic uncertainty.
Next, I’m going to show how to form models to capture this uncertainty using Bayesian deep learning.
Bayesian deep learning
Bayesian deep learning is a field at the intersection between deep learning and Bayesian probability theory.
It offers principled uncertainty estimates from deep learning architectures.
These deep architectures can model complex tasks by leveraging the hierarchical representation power of deep learning, while also being able to infer complex multi-modal posterior distributions.
Bayesian deep learning models typically form uncertainty estimates by either placing distributions over model weights, or by learning a direct mapping to probabilistic outputs.
In this section I’m going to briefly discuss how we can model both epistemic and aleatoric uncertainty using Bayesian deep learning models.
Firstly, we can model Heteroscedastic aleatoric uncertainty just by changing our loss functions.
Because this uncertainty is a function of the input data, we can learn to predict it using a deterministic mapping from inputs to model outputs.
For regression tasks, we typically train with something like a Euclidean/L2 loss:
. To learn a Heteroscedastic uncertainty model, we simply can replace the loss function with the following:
where the model predicts a mean and variance .
As you can see from this equation, if the model predicts something very wrong, then it will be encouraged to attenuate the residual term, by increasing uncertainty .
However, the prevents the uncertainty term growing infinitely large. This can be thought of as learned loss attenuation.
Homoscedastic aleatoric uncertainty can be modelled in a similar way, however the uncertainty parameter will no longer be a model output, but a free parameter we optimise.
On the other hand, epistemic uncertainty is much harder to model. This requires us to model distributions over models and their parameters which is much harder to achieve at scale.
A popular technique to model this is Monte Carlo dropout sampling which places a Bernoulli distribution over the network’s weights.
In practice, this means we can train a model with dropout. Then, at test time, rather than performing model averaging, we can stochastically sample from the network with different random dropout masks.
The statistics of this distribution of outputs will reflect the model’s epistemic uncertainty.
In the previous section, I explained the properties that define aleatoric and epistemic uncertainty.
One of the exciting results in our paper was that we could show that this formulation gives results which satisfy these properties.
Here’s a quick summary of some results of a monocular depth regression model on two datasets:
Training Data
Testing Data
Aleatoric Variance
Epistemic Variance
Trained on dataset #1
Tested on dataset #1
0.485
2.78
Trained on 25% dataset #1
Tested on dataset #1
0.506
7.73
Trained on dataset #1
Tested on dataset #2
0.461
4.87
Trained on 25% dataset #1
Tested on dataset #2
0.388
15.0
These results show that when we train on less data, or test on data which is significantly different from the training set, then our epistemic uncertainty increases drastically.
However, our aleatoric uncertainty remains relatively constant – which it should – because it is tested on the same problem with the same sensor.
Uncertainty for multi-task learning
Next I’m going to discuss an interesting application of these ideas for multi-task learning.
Multi-task learning aims to improve learning efficiency and prediction accuracy by learning multiple objectives from a shared representation.
It is prevalent in many areas of machine learning, from NLP to speech recognition to computer vision.
Multi-task learning is of crucial importance in systems where long computation run-time is prohibitive, such as the ones used in robotics.
Combining all tasks into a single model reduces computation and allows these systems to run in real-time.
Most multi-task models train on different tasks using a weighted sum of the losses.
However, the performance of these models is strongly dependent on the relative weighting between each task’s loss.
Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice.
In our recent paper, we propose to use homoscedastic uncertainty to weight the losses in multi-task learning models.
Since homoscedastic uncertainty does not vary with input data, we can interpret it as task uncertainty.
This allows us to form a principled loss to simultaneously learn various tasks.
We explore multi-task learning within the setting of visual scene understanding in computer vision.
Scene understanding algorithms must understand both the geometry and semantics of the scene at the same time.
This forms an interesting multi-task learning problem because scene understanding involves joint learning of various regression and classification tasks with different units and scales.
Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task.
Multi-task learning improves the smoothness and accuracy for depth perception because it learns a representation that uses cues from other tasks, such as segmentation (and vice versa).
Some challenging research questions
Why doesn’t Bayesian deep learning power all of our A.I. systems today? I think they should, but there are a few really tough research questions remaining.
To conclude this blog I’m going to mention a few of them:
Real-time epistemic uncertainty techniques are preventing epistemic uncertainty models from being deployed in real-time robotics applications.
Either increasing sample efficiency, or new methods which don’t rely on Monte Carlo inference would be incredibly beneficial.
Benchmarks for Bayesian deep learning models. It is incredibly important to quantify improvement to rapidly develop models – look at what benchmarks like ImageNet have done for computer vision.
We need benchmark suites to measure the calibration of uncertainty in BDL models too.
Better inference techniques to capture multi-modal distributions. For example, see the demo Yarin set up here which shows some multi-modal data that MC dropout inference fails to model.
Deep learning has revolutionised computer vision.
Today, there are not many problems where the best performing solution is not based on an end-to-end deep learning model.
In particular, convolutional neural networks are popular as they tend to work fairly well out of the box.
However, these models are largely big black-boxes. There are a lot of things we don’t understand about them.
Despite this, we are getting some very exciting results with deep learning.
Remarkably, researchers are able to claim a lot of low-hanging fruit with some data and 20 lines of code using a basic deep learning API.
While these results are benchmark-breaking, I think they are often naive and missing a principled understanding.
In this blog post I am going to argue that people often apply deep learning models naively to computer vision problems – and that we can do better.
I think a really good example is with some of my own work from the first year of my PhD.
PoseNet was an algorithm I developed for learning camera pose with deep learning.
This problem has been studied for decades in computer vision, and has some really nice surrounding theory.
However, as a naive first year graduate student, I applied a deep learning model to learn the problem end-to-end and obtained some nice results.
Although, I completely ignored the theory of this problem.
At the end of the post I will describe some recent follow on work which looks at this problem from a more theoretical, geometry based approach which vastly improves performance.
I think we’re running out of low-hanging fruit, or problems we can solve with a simple high-level deep learning API.
Specifically, I think many of the next advances in computer vision with deep learning will come from insights to geometry.
What do I mean by geometry?
In computer vision, geometry describes the structure and shape of the world.
Specifically, it concerns measures such as depth, volume, shape, pose, disparity, motion or optical flow.
The dominant reason why I believe geometry is important in vision models is that it defines the structure of the world, and we understand this structure (e.g. from the many prominent textbooks).
Consequently, there are a lot of complex relationships, such as depth and motion, which do not need to be learned from scratch with deep learning.
By building architectures which use this knowledge, we can ground them in reality and simplify the learning problem.
Some examples at the end of this blog show how we can use geometry to improve the performance of deep learning architectures.
The alternative paradigm is using semantic representations.
Semantic representations use a language to describe relationships in the world. For example, we might describe an object as a ‘cat’ or a ‘dog’.
But, I think geometry has two attractive characteristics over semantics:
Geometry can be directly observed. We see the world’s geometry directly using vision.
At the most basic level, we can observe motion and depth directly from a video by following corresponding pixels between frames.
Some other interesting examples include observing shape from shading or depth from stereo disparity.
In contrast, semantic representations are often proprietary to a human language, with labels corresponding to a limited set of nouns, which can’t be directly observed.
Geometry is based on continuous quantities. For example, we can measure depth in metres or disparity in pixels.
In contrast, semantic representations are largely discretised quantities or binary labels.
Why are these properties important? One reason is that they are particularly useful for unsupervised learning.
A structure from motion reconstruction of the geometry around central Cambridge, UK - produced from my phone's video camera.
Unsupervised learning
Unsupervised learning is an exciting area in artificial intelligence research which is about learning representation and structure without labeled data.
It is particularly exciting, because getting large amounts of labeled training data is difficult and expensive.
Unsupervised learning offers a far more scalable framework.
We can use the two properties which I described above to form unsupervised learning models with geometry: observability and continuous representation.
For example, one of my favourite papers last year showed how to use geometry to learn depth with unsupervised training.
I think this is a great example of how geometric theory and the properties described above can be combined to form an unsupervised learning model.
Other research papers have also demonstrated similar ideas which use geometry for unsupervised learning from motion.
One problem with relying just on semantics to design a representation of the world, is that semantics are defined by humans.
It is essential for an AI system to understand semantics to form an interface with humanity.
However, because semantics are defined by humans, it is also likely that these representations aren’t optimal.
Learning directly from the observed geometry in the world might be more natural.
It is also understood that low level geometry is what we use to learn to see as infant humans.
According to the American Optometric Association,
we spend the first 9 months of our lives learning to coordinate our eyes to focus and perceive depth, colour and geometry.
It is not until 12 months when we learn how to recognise objects and semantics.
This illustrates that a grounding in geometry is important to learn the basics in human vision.
I think we would do well to take these insights into our computer vision models.
A machine's semantic view of the world (a.k.a. SegNet). Each colour represents a different semantic class - such as road, pedestrian, sign, etc.
Examples of geometry in my recent research
I’d like to conclude this blog post by giving two concrete examples of how we can use geometry in deep learning from my own research:
Learning to relocalise with PoseNet
In the introduction to this blog post I gave the example of PoseNet which is a monocular 6-DOF relocalisation algorithm.
It solves what is known as the kidnapped robot problem.
In the initial paper from ICCV 2015, we solved this by learning an end-to-end mapping from input image to 6-DOF camera pose.
This naively treats the problem as a black box.
At CVPR this year, we are going to presenting an update to this method which considers the geometry of the problem.
In particular, rather than learning camera position and orientation values as separate regression targets, we learn them together using the geometric reprojection error.
This accounts for the geometry of the world and gives significantly improved results.
Predicting depth with stereo vision
The second example is in stereo vision – estimating depth from binocular vision.
I had the chance to work on this problem while spending a fantastic summer with Skydio, working on the most advanced drones in the world.
Stereo algorithms typically estimate the difference in the horizontal position of an object between a rectified pair of stereo images.
This is known as disparity, which is inversely proportional to the scene depth at the corresponding pixel location.
So, essentially it can be reduced to a matching problem - find the correspondences between objects in your left and right image and you can compute depth.
The top performing algorithms in stereo predominantly use deep learning, but only for building features for matching.
The matching and regularisation steps required to produce depth estimates are largely still done by hand.
We proposed the architecture GC-Net which instead looks at the problem’s fundamental geometry.
It is well known in stereo that we can estimate disparity by forming a cost volume across the 1-D disparity line.
The novelty in this paper was showing how to formulate the geometry of the cost volume in a differentiable way as a regression model.
More details can be found in the paper here.
An overview of the GC-Net architecture which uses an explicit representation of geometry to predict stereo depth.
Conclusions
I think the key messages to take away from this post are:
it is worth understanding classical approaches to computer vision problems (especially if you come from a machine learning or data science background),
learning complicated representations with deep learning is easier and more effective if the architecture can be structured to leverage the geometric properties of the problem.