The world is growing and faster than ever. Scientist predict we will cross 9B people this century. And while there are arguments about when that growth will stop and how quickly, one thing is clear: we are going to run out of food soon.
Recent research proposes that in order to accommodate such a large population, we are going to have to change our diet. Specifically, move away from meat and consume mostly (if not exclusively) fruit and vegetable.
Meat consumption has dire effects on the environment. It is expensive to grow, raises moral concerns, but primarily it is extremely ineffective. An average cow will consume 30 times more calories in its lifetime than it will provide. The number is much smaller (about 5 times) for poultry. And of course, the most effective way is to consume vegetable based calories.
The worldwide trend is clear: as countries get wealthier, their meat consumption increases. Together with the fact that the world’s population is larger than ever and growing, we are heading towards an environmental crisis and likely famine.
What can we do? Other than switching to a vegetarian diet, we also need to start thinking about optimizing our farming. This is where robotics, computer vision and machine learning can help. For example, fully automated farms that are controlled by algorithms can be extremely effective and minimize waste, space and therefore cost and environmental footprint.
Tracking the 6D pose of an object is an important task with many applications. For example, it’s a key component for AR, VR and many robotics systems.
There are many approaches to solve this problem. For example, when you know something about the size of the object and trackable features on it. Or, when the tracked object is a moving camera (ie. SLAM). These techniques all rely on the stereoscopic vision principle (directly or indirectly). They basically relate multiple views to compute depth. The main issue with such methods is objects that are weakly textured, low resolution videos and low light conditions.
An alternative approach is using depth sensing data acquired by RGB-D cameras. Such solutions are quite reliable but expensive in power, cost, weight, etc. This makes them less than optimal for mobile devices.
People, however, seem to be able to determine the pose of an object effortlessly from a single image with no depth sensor. This is likely so because we have vast knowledge of object sizes and appearance. Which, of course, begs the question: can we estimate pose using machine learning?
In a recent paper, the authors propose an extension of YoLo to 3D. Yolo is a deep learning algorithm that detects objects by determining the correct 2D bounding box. Extending YoLo is therefore pretty straight forward. We want to learn the 2D corners of a projected 3D bounding box. If we can do that, reconstructing the 3D pose of the bounding box is simple.
The input to the system is a single-shot RGB image. The output is the object’s 6D pose. The proposed solution runs at an impressive 50 frames per second. This rate make it suitable for real time object tracking. The key component of this system is a new CNN architecture that directly predicts the 2D image locations of the projected vertices of the object’s 3D bounding box. The object’s 6D pose is then estimated using a PnP algorithm.
The authors show some impressive results. Here’s an image showing the computed bounding box for a few objects:
Deep Neural Networks have enabled tremendous progress in computer vision. They help us solve many detection, recognition and classification tasks that seemed out of reach not too long ago. However, DNNs are known to be vulnerable to adversarial examples. That is, one can tweak the input to mislead the network. For example, it is well documented that small perturbations can dramatically increase prediction errors. Of course, the resulting image looks like it was artificially manipulated and the process can be mitigated with de-noising techniques.
In this paper the authors propose a new class of adversarial examples. The approach is quite simple. The authors convert the image from RGB into HSV (Hue, Saturation and Value). Then, they randomly shift hue and saturation while keeping the value fixed.
Here’s an example of manipulating Hue and Saturation for a single image:
The resulting image looks very much like the original one to the human perception system. At the same time, it has a dramatic negative impact on the pre-trained network. In this paper, the CIFAR10 dataset was manipulated as described and then tested on a VGG16 network. The resulting accuracy drop from ~93$ to about 6%.
Here are some classification results. You can clearly see how the change in hue and saturation throws off the network and leads to pretty much random classification:
There are so many cool areas where computer vision can make a difference. We typically think about autonomous cars, robotic manipulators, or face recognition. But there are so many other areas. One of them is agriculture. Computer vision can make certain tasks easier, more efficient and more accurate. This could lead to cheaper and more available food supply.
Here’s a short review of a paper by Oppenheim et al. from Ben-Gurion University in Israel. You can find the publication here.
Oppenheim et al. write about using computer vision techniques to detect and count yellow tomato flowers in a greenhouse. This work is more practical than academic. They develop algorithms that can handle real-world conditions such as uneven illumination, complex growth conditions and different flower sizes.
Interestingly, the proposed solution is designed to run on a drone. This has significant implications on SWaP-C: size, weight, power and cost. It also greatly effects what is computationally feasible.
The proposed algorithm begins with computing the lighting conditions and transforming the RGB image into the HSV color space. Then, the image is segmented into background and foreground using color cues, and finally a simple classification is performed on the foreground patches to determine whether they are flowers or not.
Converting the image to HSV is straightforward. Computing lighting condition is about figuring out the whether the image is dark or bright. The authors use two indicators: median value of saturation (S channel) and the skew of the saturation histogram.
Here’s how the flowers are imaged from different perspectives by the drone:
During segmentation, and after considering the lighting conditions, the paper simply suggests thresholding yellow pixels. It helps, of course, that the flower’s yellow color is very distinguishable. The idea is to keep pixels that are yellow with low saturation while removing very bright pixels from consideration.
Next, morphological operations are performed to eliminate small disconnected region and “close” holes in otherwise large segments. This creates more coherent image patches in which the algorithm believes it detected a flower.
The last step is classification. The algorithm goes over all connected components / image patches it extracted during segmentation and cleaned up with morphological operations. Small connected components are being discarded. The remaining connected components are considered to be good exemplars of yellow tomato flowers.
Here’s the algorithm’s performance according to the paper:
The plots shows the algorithm’s performance is best with a front view of the flower. Of course, that’s not surprising as this would be the clearest perspective.
I like the fact that this work tackles a practical problem. Solving this has clear applications. What I found a bit missing is a discussion about the more challenging cases of flower arrangement (eg. flowers overlapping in the image). In addition, I’d be curious to know how this method compares to a machine learning based approach that learns a model of the tomato flower from examples.
All images and data are from the paper.
I highly recommend reading it for a more thorough understanding of the work.
Image processing was for a long time focused on analyzing individual images. With early success and more compute power, researchers began looking into videos and 3D reconstruction of the world. This post will focus on 3D. There are many ways to get depth from visual data, I’ll discuss the most popular categories.
There are many dimensions to compare 3D sensing designs. These include compute, power consumption, manufacturing cost, accuracy and precision, weight, and whether the system is passive or active. For each design choice below, we’ll discuss these dimensions.
The most basic technique for depth reconstruction is having prior knowledge. In a previous post (Position Tracking for Virtual Reality), I showed how this can be done for a Virtual Reality headset. The main idea is simple: if we know the shape of an object, we know what its projected image will look like at a fixed distance. Therefore, if we have the projected image of a known object, it’s straightforward to determine its distance from the camera.
This technique might not seem very important at first glance. However, keep in mind how good people are at estimating distance with one eye shut. The reason for that is our incredible amount of prior knowledge about the world around us. If we are building robots that perform many tasks, it’s reasonable to expect them to build models of the world over time. These models would make 3D vision easier than starting from scratch every time.
But, what if we don’t have any prior knowledge? Following are several techniques and sensor categories that are suitable for unknown objects.
Stereo vision is about using two camera to reconstruct depth. The key is knowledge of the calibration of the cameras (intrinsics) and between the cameras (extrinsics). With calibration, when we view a point in both cameras, we can determine its distance using simple geometry.
Here’s an illustration of imaging a point (x,y,z) by two cameras with a known baseline:
We don’t know the coordinate z of the point, but we do know where it gets projected on the plane of each camera. This gives us a ray starting at the plane of each camera with known angles. We also know the distance between the two projection points. What remain is determining where the two rays intersect.
It’s worth noting that to compute the ray’s angle we need to know the intrinsics of the camera, and specifically the focal length
Stereo vision is intuitive and simple. Image sensors are getting cheaper while resolution is going up. And, stereo vision has a lot in common with human vision, which means our systems can see exactly when people expect to see. And finally, stereo systems are passive — they don’t emit a signal into the environment — which makes them easier to build and energetically efficient.
But, stereo system isn’t a silver bullet. First, matching points/features between the two views can be computationally expensive and not very reliable. Second, some surfaces either don’t have features (white walls, shiny silver spoon, …) or have many features that are indistinguishable from each other. And finally, stereo doesn’t work when it’s too dark or too bright.
Structure from Motion
Structure from motion replace the fixed second camera in the stereo design with a single moving camera. The idea here is that as the camera is moving around the world, we can use views from two different perspectives to reconstruct depth much like we did in the stereo case.
Now’s a great time to say: “wait what?! didn’t you say stereo requires knowledge of the extrinsics?”. Great question! Structure from motion is more difficult in that we not only want to determine the depth of a point, but we also need to compute the camera motion. And no, it’s not an impossible problem to solve.
The key realization is that as the camera is moving, we can try to match many points between two views, not just one. We can assume that most points are part of the static environment (ie. not moving as well). Our camera motion model therefore must provide a consistent explanation for all points. Furthermore, with structure from motion we can easily obtain multiple perspectives (not just two like stereo).
Structure from motion can be understood as a large optimization problem where we have N points in k frames and we’re trying to determine 6*(k-1) camera motion parameters and N depth values for our points from N*k observations. It’s easy to see how with enough points this problem quickly becomes over constrained and therefore very robust. For example, in the above image N=6 and k=3: we need to solve a system of 18 equations with 18 unknown.
The key advantages of structure from motion vs. stereo include simple hardware (just one camera), simple calibration (we only need intrinsics), and simple point matching (because we can rely on tracking frame-to-frame).
There are two main disadvantages. First, structure from motion is more complex mathematically. Second, structure from motion provides us a 3D reconstruction up to a scaling parameter. That is, we can always multiply the depth of all our points by some number and similarly multiply the camera’s motion and everything remains consistent. In stereo, this scaling ambiguity is solved through calibration: we know the baseline between the two cameras.
Structure from motion is typically solved as either an optimization problem (eg. bundle adjustment) or a filtering problem (eg. extended Kalman filter). It’s also quite common to mix the two: use bundle adjustment at a lower frequency and the Kalman filter at the measurement frequency (high).
There are several technologies that try to measure depth directly. In contrast with the above methods, the goal is to recover depth from a single frame. Let’s discuss the most prominent technologies:
We already know the advantages and disadvantages of stereoscopic vision. But what if our hardware could make some disadvantages go away?
In active stereo systems such as https://realsense.intel.com/stereo/ a noisy infra red pattern is projected on the environment. The system doesn’t know anything about the pattern, but its existence creates lots of features for stereo matching.
In addition, stereo matching is computed in hardware, making it faster and eliminates the processing burden on the host.
However, as its name implies, active stereo is an active sensor. Projecting the IR pattern requires additional power and imposes limitations on range. The addition of a projector also makes the system more complex and expensive to manufacture.
Structured light shares some similarities with active stereo. Here too a pattern is projected. However, in structured light a known pattern is projected onto the scene. Knowledge of how the pattern deforms when hitting surfaces enables depth reconstruction.
Typical structured light sensors (eg. PrimtSense) project infra red light and have no impact on the visible range. That is, humans cannot see the pattern, and other computer vision tasks are not impacted.
The advantages of structured light are that depth can be computed in a single frame using only one camera. The known pattern enable algorithmic shortcuts and results in accuracy and precise depth reconstruction.
Structured light, however, has some disadvantages: the system complexity increases because of the projector, computation is expensive and typically requires hardware acceleration, and infra red interference is possible.
Time of Flight
Time of flight sensors derive their name from the physics behind the device. It relies on the known time it takes an infrared beam to travel a distance through a known medium (eg. air). The sensor emits a light beam and measures the time it takes for the light to return to the sensor from different surfaces in the scene.
Kinect One is an example of such sensor:
Time of Flight sensors are more expensive to manufacture and require careful calibration. They also have a higher power consumption compared to the other technologies.
However, time of flight sensors do all depth computation on chip. That is, every pixel measures the time of flight and returns a computed depth. There is practically no computational impact on the host system.
We reviewed different technologies for depth reconstruction. It’s obvious that each has advantages and disadvantages. Ultimately, if you are designing a system requiring depth, you’re going to have to make the right trade-offs for your setup.
A typical set of parameters to consider when choosing your solution is SWaP-C (Size, Weight, Power and Cost). Early on, it’s often better to choose a simple HW solution that requires significant computation and power. As your algorithmic solution stabilizes, it is easy to correct SWaP-C later on with dedicated hardware.