Follow Alex Kendall Blog on Feedspot

Continue with Google
Continue with Facebook


Today I can share my final PhD thesis, which I submitted in November 2017. It was examined by Dr. Joan Lasenby and Prof. Andrew Zisserman in February 2018 and has just been approved for publication. This thesis presents the main narrative of my research at the University of Cambridge, under the supervision of Prof Roberto Cipolla. It contains 206 pages, 62 figures, 24 tables and 318 citations. You can download the complete .pdf here.

My thesis presents contributions to the field of computer vision, the science which enables machines to see. This blog post introduces the work and tells the story behind this research.

This thesis presents deep learning models for an array of computer vision problems: semantic segmentation, instance segmentation, depth prediction, localisation, stereo vision and video scene understanding.

The abstract

Deep learning and convolutional neural networks have become the dominant tool for computer vision. These techniques excel at learning complicated representations from data using supervised learning. In particular, image recognition models now out-perform human baselines under constrained settings. However, the science of computer vision aims to build machines which can see. This requires models which can extract richer information than recognition, from images and video. In general, applying these deep learning models from recognition to other problems in computer vision is significantly more challenging.

This thesis presents end-to-end deep learning architectures for a number of core computer vision problems; scene understanding, camera pose estimation, stereo vision and video semantic segmentation. Our models outperform traditional approaches and advance state-of-the-art on a number of challenging computer vision benchmarks. However, these end-to-end models are often not interpretable and require enormous quantities of training data.

To address this, we make two observations: (i) we do not need to learn everything from scratch, we know a lot about the physical world, and (ii) we cannot know everything from data, our models should be aware of what they do not know. This thesis explores these ideas using concepts from geometry and uncertainty. Specifically, we show how to improve end-to-end deep learning models by leveraging the underlying geometry of the problem. We explicitly model concepts such as epipolar geometry to learn with unsupervised learning, which improves performance. Secondly, we introduce ideas from probabilistic modelling and Bayesian deep learning to understand uncertainty in computer vision models. We show how to quantify different types of uncertainty, improving safety for real world applications.

The story

I began my PhD in October 2014, joining the controls research group at Cambridge University Engineering Department. Looking back at my original research proposal, I said that I wanted to work on the ‘engineering questions to control autonomous vehicles… in uncertain and challenging environments.’ I spent three months or so reading literature, and quickly developed the opinion that the field of robotics was most limited by perception. If you could obtain a reliable state of the world, control was often simple. However, at this time, computer vision was very fragile in the wild. After many weeks of lobbying Prof. Roberto Cipolla (thanks!), I was able to join his research group in January 2015 and begin a PhD in computer vision.

When I began reading computer vision literature, deep learning had just become popular in image classification, following inspiring breakthroughs on the ImageNet dataset. But it was yet to become ubiquitous in the field and be used in richer computer vision tasks such as scene understanding. What excited me about deep learning was that it could learn representations from data that are too complicated to hand-design.

I initially focused on building end-to-end deep learning models for computer vision tasks which I thought were most interesting for robotics, such as scene understanding (SegNet) and localisation (PoseNet). However, I quickly realised that, while it was a start, applying end-to-end deep learning wasn’t enough. In my thesis, I argue that we can do better than naive end-to-end convolutional networks. Especially with limited data and compute, we can form more powerful computer vision models by leveraging our knowledge of the world. Specifically, I focus on two ideas around geometry and uncertainty.

  • Geometry is all about leveraging structure of the world. This is useful for improving architectures and learning with self-supervision.
  • Uncertainty understands what our model doesn’t know. This is useful for robust learning, safety-critical systems and active learning.

Over the last three years, I have had the pleasure of working with some incredibly talented researchers, studying a number of core computer vision problems from localisation to segmentation to stereo vision.

Bayesian deep learning for modelling uncertainty in semantic segmentation. The science

This thesis consists of six chapters. Each of the main chapters introduces an end-to-end deep learning model and discusses how to apply the ideas of geometry and uncertainty.

Chatper 1 - Introduction. Motivates this work within the wider field of computer vision.

Chapter 2 - Scene Understanding. Introduces SegNet, modelling aleatoric and epistemic uncertainty and a method for learning multi-task scene understanding models for geometry and semantics.

Chapter 3 - Localisation. Describes PoseNet for efficient localisation, with improvements using geometric reprojection error and estimating relocalisation uncertainty.

Chapter 4 - Stereo Vision. Designs an end-to-end model for stereo vision, using geometry and shows how to leverage uncertainty and self-supervised learning to improve performance.

Chapter 5 - Video Scene Understanding. Illustrates a video scene understanding model for learning semantics, motion and geometry.

Chapter 6 - Conclusions. Describes limitations of this research and future challenges.

An overview of the models considered in this thesis. As for what’s next?

This thesis explains how to extract a robust state of the world – semantics, motion and geometry – from video. I’m now excited about applying these ideas to robotics and learning to reason from perception to action. I’m working with an amazing team on autonomous driving, bringing together the worlds of robotics and machine learning. We’re using ideas from computer vision and reinforcement learning to build the most data-efficient self-driving car. And, we’re hiring, come work with me! wayve.ai/careers

I’d like to give a huge thank you to everyone who motivated, distracted and inspired me while writing this thesis.

Here’s the bibtex if you’d like to cite this work.

  title={Geometry and Uncertainty in Deep Learning for Computer Vision},
  author={Kendall, Alex},
  school={University of Cambridge}

And the source code for the latex document is here.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

2017 was an exciting year as we saw deep learning become the dominant paradigm for estimating geometry in computer vision.

Learning geometry has emerged as one of the most influential topics in computer vision over the last few years.

“Geometry is … concerned with questions of shape, size, relative position of figures and the properties of space” (wikipedia).

We’ve first seen end-to-end deep learning models for these tasks using supervised learning, for example depth estimation (Eigen et al. 2014), relocalisation (PoseNet 2015), stereo vision (GC-Net 2017) and visual odometry (DeepVO 2017) are examples. Deep learning excels at these applications for a few reasons. Firstly, it is able to learn higher order features which reason over shapes and objects with larger context than point-based classical methods. Secondly, it is very efficient for inference to simply run a forward pass of a convolutional neural network which approximates an exact geometric function.

Over the last year, I’ve noticed epipolar geometry and reprojection losses improving these models, allowing them to learn with unsupervised learning. This means they can train without expensive labelled data by just observing the world. Reprojection losses have contributed to a number of significant breakthroughs which now allow deep learning to outperform many traditional approaches to estimating geometry. Specifically, photometric reprojection loss has emerged as the dominant technique for learning geometry with unsupervised (or self-supervised) learning. We’ve seen this across a number of computer vision problems:

  • Monocular Depth: Reprojection loss for deep learning was first presented for monocular depth estimation by Garg et al. in 2016. In 2017, Godard et al. show how to formulate left-right consistency checks to improve results.
  • Optical Flow: this requires training reprojection disparities over 2D and has been demonstrated by Yu et al. 2016, Ren et al. 2017 and Meister et al. 2018.
  • Stereo Depth: in my PhD thesis I show how to extend our stereo architecture, GC-Net, to learn stereo depth with epipolar geometry & unsupervised learning.
  • Localisation: I presented a paper at CVPR 2017 showing how to train relocalisation systems by learning to project 3D geometry from structure from motion models Kendall & Cipolla 2017.
  • Ego-motion: learning depth and ego motion with reprojection loss now out performs traditional methods like ORB-SLAM over short sequences under constrained settings (Zhou et al. 2017) and (Li et al. 2017).
  • Multi-View Stereo: projection losses can also be used in a supervised setting to learn structure from motion, for example DeMoN and SfM-Net.
  • 3D Shape Estimation: projection geometry also aids learning 3D shape from images in this work from Jitendra Malik’s group.

In this blog post I’d like to highlight the importance of epipolar geometry and how we can use it to learn representations of geometry with deep learning.

An example of state of the art monocular depth estimation with unsupervised learning using reprojection geometry (Godard et al. 2017) What is reprojection loss?

The core idea behind reprojection losses is using epipolar geometry to relate corresponding points in multi-view stereo imagery. To dissect this jargon-filled sentence; epipolar geometry relates the projection of 3D points in space to 2D images. This can be thought of as triangulation (see the figure below). The relation between two 2D images is defined by the Fundamental matrix. If we choose a point on one image and know the fundamental matrix, then this geometry tells us that the same point must lie on a line in the second image, called the epipolar line (the red line in the figure below). The exact point of the correspondence on the epipolar line is defined by the 3D point’s depth in the scene.

If these two images are from a rectified stereo camera then this is a special type of multi-view geometry, and the epipolar line is horizontal. We then refer to the corresponding point’s position on the epipolar line as disparity. Disparity is inversely proportional to metric depth.

The standard reference for this topic is the textbook, “Multiple View Geometry in Computer Vision” Hartley and Zisserman, 2004.

Epipolar geometry relates the same point in space seen by two cameras and can be used to learn 3D geometry from multi-view stereo (Image borrowed from Wikipedia).

One way of exploiting this is learning to match correspondences between stereo images along this epipolar line. This allows us to estimate pixel-wise metric depth. We can do this using photometric reprojection loss (Garg et al. in 2016). The intuition behind reprojection loss is that pixels representing the same object in two different camera views look the same. Therefore, if we relate pixels, or determine correspondences between two views, the pixels should have identical RGB pixel intensity values. The better the estimate of geometry, the closer the photometric (RGB) pixel values will match. We can optimise for values which provide matching pixel intensities between each image, known as minimising the photometric error.

An important property of these losses is that they are unsupervised. This means that we can learn these geometric quantities by observing the world, without expensive human-labelled training data. This is also known as self-supervised learning.

The list of papers at the start of this post further extend this idea to optical flow, depth, ego-motion, localisation etc. — all containing forms of epipolar geometry.

Does this mean learning geometry with deep learning is solved?

I think there are some short-comings to reprojection losses.

Firstly, photometric reprojection loss makes a photometric consistency assumption. This means it assumes that the same surface has the same RGB pixel value between views. This assumption is usually valid for stereo vision, because both images are taken at the same time. However, this is not always the case for learning optical flow or multi-view stereo, because appearance and lighting changes over time. This is because of occlusion, shadows and the dynamic nature of scenes.

Secondly, reprojection suffers from the aperture problem. The aperture problem is unavoidable ambiguity of structure due to a limited field of view. For example, if we try to learn depth by photometric reprojection, our model cannot learn from areas with no texture, such as sky or featureless walls. This is because the reprojection loss is equal across areas of homogeneous texture. To resolve the correct reprojection we need context! This problem is usually resolved by a smoothing prior, which encourages the output to be smooth where there is no training signal, but this also blurs correct structure.

Thirdly, we don’t need to reconstruct everything. Learning to reproject pixels is similar to an auto encoder — we learn to encode all parts of the world equally. However, for many practical applications, attention based reasoning has been shown to be most effective. For example, in autonomous driving we don’t need to learn the geometry of building facades and the sky, we only care about the immediate scene in front of us. However, reprojection losses will treat all aspects of the scene equally.

State of the art stereo depth estimation from GC-Net. This figure shows the saliency of the model with respect to the depth prediction for the point with the white cross. This demonstrates the model uses a wider context of the surrounding car and road to make its prediction. How can we improve performance?

It is difficult to learn geometry alone, I think we need to incorporate semantics. There is some evidence that deep learning models learn semantic representations implicitly from patterns in the data. Perhaps our models could more explicitly exploit this?

I think we need to reproject into a better space than RGB photometric space. We would like this latent space to solve the problems above. It should have enough context to address the aperture problem, be invariant to small photometric changes and emphasise task-dependant importance. Training on the projection error in this space should result in a better performing model.

After the flurry of exciting papers in 2017, I’m looking forward to further advances in 2018 in one of the hottest topics in computer vision right now.

I first presented the ideas in this blog post at the Geometry in Deep Learning Workshop at the International Conference on Computer Vision 2017. Thank you to the organisers for a great discussion.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

When I am in the pub and I tell people I am working on Artificial Intelligence (AI) research, the conversation that invariably comes up is, “Why are you building machines to take all our jobs?” However, within AI research communities, this topic is rarely discussed. My experience with colleagues is that it is often dismissed with off-hand arguments such as, “We’ll make more advanced jobs to replace those which are automated”. I’d like to pose the question to all AI researchers: how long have you actually sat down and thought about ethics? Unfortunately, the overwhelming opinion is that we just build the tools and ethics are for the end-users of our algorithms to deal with. But, I believe we have a professional responsibility to build ethics into our systems from the ground up. In this blog I’m going to discuss how to build ethical algorithms.

How do we implement ethics? There are a number of ethical frameworks which philosophers have designed; virtue ethics (Aristotle), deontological ethics (Kant) and teleological ethics (Mill, Bentham). However Toby Walsh, Professor of AI at the University of New South Wales in Australia, has a different view:

“For too long philosophers have come up with vague frameworks. Now that we have to write code to actually implement AI, it forces us to define a much more precise and specific moral code”

Therefore, AI researchers may even have an opportunity to make fundamental contributions to our understanding of ethics! I have found it interesting to think about what ethical issues such as trust, fairness and honesty mean for AI researchers.

Concrete ethical issues for machine learners

In this section I am going to discuss concrete ways to implement trust, fairness and honesty in AI models. I will try to translate these ethical topics into actual machine learning problems.


It is critical that users trust AI systems, otherwise their acceptance in society will be jeopardised. For example, if a self-driving car is not trustworthy, it is unlikely anyone will want to use it. Building trustworthy algorithms means we must make them safe by:

  • improving accuracy and performance of algorithms. We are more likely to trust something which is accurate. This is something which most machine learning researchers do,
  • designing algorithms which are aware of their uncertainty and understand what they do not know. This means they will not make ill-founded decisions. See a previous blog I wrote on Bayesian deep learning for some ideas here,
  • making well-founded decision making or control policies which do not over-emphasise exploration over exploitation.

We have a responsibility to make AI fair. This means to remove unwanted bias in algorithms. Unfortunately, there are many examples of biased / unfair AI systems today. For example:

There are in-fact many sources of bias in algorithms. For an excellent taxonomy, see this paper. In some situations we even want bias - but this is something we must understand. Concrete problems to improve fairness and improve bias in AI systems include:

  • improving data efficiency to better learn rare classes,
  • improve methodologies for collecting and labelling data to remove training data bias,
  • improved causal reasoning so we can remove an algorithm’s access to explanatory variables which we deem morally unusable (e.g. race).

Honesty requires algorithms to be transparent and interpretable. We should expect algorithms to provide reasonable explanations for the decisions they make. This is going to become a significant commercial concern in the EU when the new GDPR data laws go into effect in 2018. They grant users the right to a logical explanation of how an algorithm uses our personal data. This is going to require advances in:

  • saliency techniques which explain causality from input to output,
  • interpretablity of black box models. Models must be able to explain their reasoning. This may be by forward simulation of an internal state or by analysing human interpretable intermediate outputs.
Will we be accountable for the AI we build?

Perhaps of more immediate concern to AI researchers is if we will be accountable for the algorithms we build. If a self-driving car fails to obey the road rules and kills pedestrians, is the computer vision engineer at fault? Another interesting example is the venture capitalist Deep Knowledge, who placed an AI on their board of directors in 2014. What happens if this AI fails in its responsibilities as company director? Today, AI cannot be legally or criminally accountable as it has no legal entity and owns no assets. This means liability lies with the manufacturer.

Unfortunately, it is unlikely we will be able to get insurance to cover liability of autonomous systems. This is because there is a total lack of data on them. Without these statistics, insurance firms are unable to estimate risk profiles.

Most professional bodies (law, medicine, engineering) regulate standards and values of their profession. In my undergraduate engineering school we had compulsory courses on professional standards and ethics. Why do most computer science courses not do the same?

Medical research trials involving humans have to pass several ethics committee processes before being admitted. I don’t think we need the same approval before running a deep learning experiment, but I think that there needs to be more awareness from the tech world. For example, in 2014 Facebook conducted human subject trials which were widely condemned. They wanted to see if they could modify human emotion by changing the types of content shown in their newsfeed. Is this ethical? Did the 689,000 people involved willingly consent? Why are tech companies exempt from the ethical procedures we place on other fields of research?

Is AI going to steal our jobs?

Going back to the original question I was asked in the pub, “will robots steal our jobs?” I think there are some real concerns here.

The AI revolution will be different to the industrial revolution. The industrial revolution lasted 100 years, which is 4 generations. This was sufficient time for subsequent generations to be re-skilled in their jobs of the future. The disruptive technology due to AI is likely to occur much more rapidly (perhaps a single generation). We will need to re-skill within our lifetime, or else become jobless.

It is worth noting that in some situations, AI is going to increase employment. For example, AI is drastically improving the efficiency of matching Kidney donors to those with Kidney disease, increasing the work for surgeons. But these will certainly be isolated examples. Disruptive AI technology is going to displace many of today’s jobs.

Hopefully automation will drastically reduce the cost of living. Perhaps this places less pressure to have a job to work for money. But it is well-known that humans need to feel self-worth to feel happy. Perhaps entertainment and education will be enough for many people?

For those for whom it is not, we need to shift to a new distribution of jobs, quickly. Here are some positive ideas I like:

  • reactive retraining of those who have jobs displaced by automation. For example, Singapore has a retraining fund for people who have been replaced by automation,
  • proactive retraining, for example changing the way we teach accountants and radiologists today, because their jobs are being displaced,
  • allow automation to increase our leisure time,
  • redeploy labour into education and healthcare which require more human interaction than other fields,
  • MOOCs and other educational tools,
  • introducing a living wage or universal basic income.

Eventually, perhaps more extreme regulation will be needed here to keep humans commercially competitive with robots.

Acknowledgements: I’d like to thank Adrian Weller for first opening my eyes to these issues. This blog was written while attending the International Joint Conference on Artificial Intelligence (IJCAI) 2017 where I presented a paper on autonomous vehicle safety. Thank you to the conference organisers for an excellent forum to discuss these topics.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview