LDV Capital is a thesis-driven early stage venture fund investing in people building visual technology businesses that leverage computer vision, machine learning and artificial intelligence to analyze visual data.
The power of object recognition and the transformative effect of deep learning to analyze scenes and parse content can have a lot of impact in advertising. At the 2016 Annual LDV Vision Summit, Ken Weiner CTO at GumGum told us about the impact of image recognition and computer vision in online advertising.
The 2017 Annual Vision Summit is this week, May 24 &25, in NYC. Come see new speakers discuss the intersection of business and visual tech.
I’m going to talk a little bit about advertising and computer vision and how they go together for us at GumGum. Digital images are basically showing up everywhere you look. You see them when you're reading editorial content. You see them when you're looking at your social feeds. They just can't be avoided these days. GumGum has basically built a platform with computer vision engineers that tries to identify a lot of information about the images that we come across online. We try to do object detection. We look for logos. We detect brand safety, sentiment analysis, all those types of things. We basically want to learn as much as we can about digital photos and images for the benefit of advertisers and marketers.
The question is: what value do marketers get from having this information? Well, for one thing, if you're a brand, you really want to know: how are users out there engaging with your brand? We look at the fire hose of social feeds. We would look for, for example, at brand logos. In this example, Monster Energy drink wants to find all the images out there where their drink appears in the photo. You have to remember about 80% of the photos out there might have no textual information that’s going to identify the fact that Monster is involved in this photo, but they are. You really need computer vision in order to understand that.
Why do they do that? They want to look at how people engage with them. They want to look at how people are engaging with their competitors. They may want to just understand what is changing over time. What are maybe some associations with their brand that they didn't know about that might come up. For example, what if they start finding out that Monster Energy drinks are appearing in all these mountain biking photos or something? That might give them a clue that they should go out and sponsor a cycling competition. The other thing they can find out with this is who are their main brand ambassadors and influencers out there. Tools like this give them a chance to connect with those people.
What makes [in-image] even more powerful is if you can connect the brand message with that image in a very contextual way and tap into the emotion that somebody’s experiencing when they’re looking at a photo.
Another product that’s been very successful for us is something we call in-image advertising. We came up with this kind of unit about eight years ago. It was really invented to combat what people call banner blindness, which is the notion that, out on a web page, you start to learn to ignore the ads that are showing at the top and the side of the page. If you were to place brand messages right in line with content that people are actively engaged with, you have a much better chance of reaching the consumer. What makes it even more powerful is if you can connect the brand message with that image in a very contextual way and tap into the emotion that somebody’s experiencing when they’re looking at a photo. Just the placement alone for an ad like this receives 10x the performance of traditional advertising because it’s something that a user pays attention to.
Obviously, we can build a big database of information about images and be able to contextually place ads like this, but sometimes situations will come from advertisers that won’t be able to draw upon our existing knowledge. We’ll have to go out and develop custom technology for them. For example, L’Oréal wanted to advertise a product for hair coloring. They asked us if we could identify every image out on different websites and identify the color of the hair of the people in the images so that they could strategically target the products that go along with those hair colors. We ran this campaign from them. They were really, really happy with it.
They liked it so much that they came back to us, and they said, “We had such a good experience with that. Now we want you to go out and find people that have bold lips,” which was a rather strange notion for us. Our computer vision engineers came up with a way to segment the lips, figure out, “What does boldness mean?” Loral was very happy. They ran a lipstick campaign on these types of images.
A couple years ago, we had a very interesting in-image campaign that I think might be the first time that the actual content that you're viewing became part of the advertising creative. What we did is, for Lifetime TV, they wanted to advertise the TV series, Witches of East End. We looked for photos where people were facing forward. When we encountered those photos, we dynamically overlaid green witch eyes onto these people. It gives people the notion that they become a little witchy for a few seconds. Then that collapses and becomes a traditional in-image ad where somebody can then, after being interested by the eyes, can go ahead and click on this to watch a Video LightBox to see the preview for the show.
I just thought this was one of the most interesting ad campaigns I’ve ever seen because it mixes the notion of content and creative into one. What’s coming after this? Naturally, this will extend into video. TV networks are already training you to look at information in the lower third of the screen. It’s only natural that this will get replaced by contextual advertising the same way we’ve done it for images online.
Another thing that I think is coming soon is the ability to really annotate specific products and items inside images at scale. People have tried to do this using crowdsourcing in the past, but it’s just too expensive. When you're looking at millions of images a day like we do, you really need information to come in a more automated way. There’s been a lot of talk about AR. Obviously, advertising’s going to have to fit into this in some way or another. It may be a local direct response advertiser. You're walking down the street. Someone gives you a coupon for McDonald’s. Maybe it’ll be a brand advertiser. You see a car accident, and they’re going to remind you that you need to get car insurance.
Lastly, I wanted to pose the idea of in-hologram ads that I think could come in the future if these things like Siri and Alexa … Now they’re voice, but in the future, who knows? They might be 3D images living in your living room, and advertisers are going to want a way to basically put their name on those holograms. Thank you very much.
At the LDV Vision Summit 2018, Rebecca Kaden of Union Square Ventures shared her insights on investing at the intersection of standout consumer businesses and vertical networks.
Rebecca and Evan Nisselson of LDV Capital discussed the greatest challenges they have seen their portfolio companies endure and how strategic team building is critical to success at the earliest stages of company building. Watch their chat here:
One week until our LDV Vision Summit 2018 - May 23 & 24 in NYC at the SVA Theatre. Limited tickets are still available to see 80 speakers in 40 sessions discuss the cutting edge in visual tech. Register now!
Raquel Urtasun is a recipient of NVIDIA Pioneers of AI Award, three Google Faculty Research Awards and several more. She lectures at the University of Toronto and the Vector Institute and is the head of Uber ATG, Toronto. At our LDV Vision Summit 2017, she spoke about how autonomous vehicles with human perception will make our cities smarter and better to live in.
It's my pleasure to be here today, and I wanted to introduce who I am just in case you guys don't know.
So I have three jobs, which keeps me quite busy. I am still an academic, one day a week I am I the University of Toronto and the Vector Institute which I co-found with a whole bunch of people that you see in the picture including Geoff Hinton. And the latest greatest news, I guess, as of May 1st 2017, I'm also heading a new lab of Uber ATG in Toronto, so self-driving cars are in Canada now and that's really, really exciting.
Today, I'm going to talk about what led to the Uber acquisition [of the University of Toronto team]. Perhaps you have already seen another discussion about why we need self-driving cars, but what is very important for me is actually that we need to lower the risk of accidents, we need to provide mobility for many people that right now cannot go to the place they want to go, and we need to think of the future of public transportation or ride sharing. In particular, we need to share resources. Ninety-five percent of the time the car is parked, so we are just utilizing our planet without real reason.
If we look at typically what is going on in self-driving car companies we find: they're pretty good at localization, path planning, and obstacle avoidance, but there are two things that they do which actually made them not super scalable. The first thing is LIDAR, the prices are dropping, but it is still quite expensive to buy a decent LIDAR. And the other thing, which is the been in the closet, is actually mapping.
What I have been working for the past seven years is how to make solutions that are scalable, meaning cheap sensors and trying to drive without maps or with as little prior knowledge as possible.
Now if you want to do something of this form, we need to think about many different things at once. The first thing that us at academic was difficult was data and so we created many years ago, I guess, it's still the only benchmark for self-driving which is KITTI. And to my despair, this is still the only benchmark, which I don't understand.
If we want to get rid of the LiDAR, get rid of the maps, one of the things that we need to...have is robust, good, and fast, stereo 3D reconstruction.
The other thing that is important is learning. Right, one can't just handcraft everything, because we need to be robust with scenarios that we have never seen before. We need holistic models to reason many things. At the end of the day, we have fixed computation for many things, many tasks, and we need to think of hardware at the same time.
If we want to get rid of the LiDAR, get rid of the maps, one of the things that we need to do is apply deep learning to have is robust, good, and fast, stereo 3D reconstruction. This can run real-time and after forty meters can basically almost replace the LIDAR.
Other things that you need to do is work on perception. You spend the past year and a half obsessed with instance segmentation. This is where you're segmenting the image. The idea is that you have a single image and you are interested in labeling every pixel but not just with the category of car, road, but also you want to estimate - this is one car, this is another car, etc... And this is a particularly difficult problem for deep learning because the loss function is agnostic, dupe or imitation. So we've built some interesting technology lately based on the what they should transform. It scales really well. It's independent of the number of objects so you can run real-time for anything. And this is triangularization. It's trained in a set of cities and tested in another set of cities. You see the prediction in the middle and the ground truth on the right. Okay so, even with crowded scenes [the model] can actually do pretty well.
Now, if you want to do self-driving, labeling pixels is not going to get you there. Right, so you need to really estimate what's happening everywhere in the scene. This is our latest, greatest results during detection and tracking. This is actually very technically interesting. You can bug propagate through solvers. And here, you see the results of what we have as well.
In general, what you want to do is estimate everything that is in the scenes. So here, we have some results that we had even a couple of years ago, with a single camera mounted on top of the car. The car is driving in intersections it has never seen before and is able to estimate the local map of the intersection. It is creating the map on the fly. It is estimating, whether your car is doing localization as well as estimating where every car is in this scene. And the traffic situation that you see on the bottom left, even though it doesn't see traffic scenes or things like that. So the cars that are color-coded in varying intentions. Basically, here we are estimating where everybody is going in the next couple of seconds. And this is as I said, [with a] single camera [and] new scenarios that we haven't trained.
Other things that you need to do is localization. Localization is an interesting problem, because typically the ways zone is that same way with us. If you go around and then you collect how the world looks like and that's really expensive, meaning that basically you need to know the appearance of the world that [the cars] are in every point in time.
It takes thirty-five seconds of driving to actually localize with a precision of 2 meters
We look at a cartographic environment and the motion of the vehicle to estimate really quickly where the vehicle is in the global coordinate system. Okay, so you see here, so you have a probability distribution over the graph of the road. The vehicles are driving, you have a few miles of the distribution and very quickly we know exactly where this vehicle is.
This is a Manhattan-like scenario, there are two miles of the distribution but again soon we are going to do something where there is only a single location. And this for the whole city of Kalser (NJ) which is two thousand kilometers of road. It takes thirty-five seconds of driving to actually localize with a precision of 2 meters, which is the precision of the maps that we use. These maps are available for free online for sixty percent of the world. So you can just download, you don't need to capture anything; it's free.
Now, in terms of mapping rights, why do car companies or self-driving car, or players use maps? You can think of a map as a sensor, which basically tells you the static part of the scene. It gives you robustness and it allows you to only look at the dynamic objects.
The problem with the way the mapping is done is that you have, say one of these cars with these expensive sensors, and basically you drive around the world, you have your data and then there is some labeling process where you basically say where are the roads, where are the lanes, where are the possible places where can park, etc. Okay, that makes you have very small coverage, because this is at the vehicle level and is very expensive. As an academic I look at "Can we actually do this by spending zero dollars?"
In those terms, we figure you can use aerial images or satellite images. Satellites pass around the earth twice a day so you have this up-to-date view of the world. And we create methods that can automatically extract the HD maps of the form that you see on the top where you have lanes, parking spots, sidewalks, etc. Yes, automatically it takes only 3 seconds in a single computer to get to estimate this perpendicular road. Basically, with a very small cluster of computers, you can run the whole world having up-to-date estimates.
Five and a half years ago, I created KITTI. And one thing that's bugged me about mapping is that is only the players, the companies, that are actually working on this. So, I created Toronto city. This is about to go online soon. The greater Toronto area is twenty percent of the population of Canada; it's huge, and we have all these different views: panoramas, LiDAR, cameras from the area views, drones, etc.
Now, as an academic, I cannot pay Labelers to label [the images]. Just the aerial images are going to cost between twenty to thirty million dollars to label it. What I did was I went to the government and I put all this information from maps that the government has captured through 3D maps of the city, every single building, etc. And then basically, with the veil of algorithms that can align the sources of information including all the different sources of imagery as well as the maps and automatically created ground truth. And here you see the quality of the ground truth is really, really, good. Now, we have ground truth for the whole Greater Toronto Area and we're gonna put online the benchmark where it sends. So this area is the tasks that you can participate with, for instance, semantic segmentation.
A little thing that we have built since then is also implementing ways to be able to extract these maps automatically. You can see from aerial images and one of the thing that was interesting is from the panoramas, you can actually get automatically centimeter accurate maps. That was actually quite interesting. Alright, to conclude, the last seven years, my group has been working on ways to make affordable self-driving cars that scale with a sense and perception, localization, and mapping. Thank you.
LDV Capital is focused on investing in people building visual technology businesses. Our LDV Vision Summit explores how visual technologies leveraging computer vision, machine learning and artificial intelligence are revolutionizing how humans communicate and do business.
Tickets are available for the LDV Vision Summit 2018, where you can hear from other amazing visual tech researchers, entrepreneurs, and investors.
Raquel Urtasun - Autonomous Vehicles with Human Perception Will Make Our Cities Smarter and Better to Live In - Vimeo
Divyaa Ravichandran was a finalist in the 2016 Entrepreneurial Computer Vision Challenge (ECVC) at the LDV Vision Summit. Her project, “Love & Vision” used siamese neural networks to predict kinship between pairs of facial images. It was a major success with the judges and the audience. We asked Divyaa some questions on what she has been up to over the past year since her phenomenal performance:
How have you advanced since the last LDV Vision Summit? After the Vision Summit I began working as an intern at a startup in the Bay Area, PerceptiMed, where I worked on computer vision methods to identify pills. I specifically worked with implementing feature descriptors and testing their robustness in detection tasks. Since October 2016, I’ve been working at Facebook as a software engineer.
What are the 2-3 key steps you have taken to achieve that advancement? a. Stay on the lookout for interesting opportunities, like the LDV Vision Summit b. ALWAYS stay up-to-date in the tech industry so you know what counts and who's who
What project(s)/work is your focus right now at or outside of Facebook? Without any specifics, I'm working with neural networks surrounded by some of the brightest minds I have come across as yet, and along with the use of Facebook's resources, the opportunities to improve are boundless.
What is your proudest accomplishment over the last year? Snagging this gig with Facebook was kind of the highlight of my year; working on projects that have the potential to impact and improve so many lives has me pretty psyched!
What was a key challenge you had to overcome to accomplish that? How did you overcome it? I think visibility was one big point: I wasn't highly visible as a candidate for the Facebook team since I had only just graduated from school and didn't have any compelling publications or such to my name. Fortunately, my attendance at the LDV Vision Summit last year gave me that visibility, and the Facebook team got in touch with me because of that.
Did our LDV Vision Summit help you? If yes, how? Yeah, it was through LDV that I came in contact with my current employer at Facebook! I also met some really interesting people from some far-off places, like Norway, for instance. It put into perspective how the field is growing the world-over.
What was the most valuable aspect of competing in the ECVC for you? The fact that the summit puts the guys with the money (the VCs) in touch with the guys with the tech (all the people making Computer Vision pitches) really bridges the gap between two shores that I think would do very well juxtaposed with each other. Personally, it opened my eyes to new ideas that people in the field were looking at and what problems they were trying to tackle, something that I wouldn't have been able to think up myself.
What recommendation(s) would you make to teams submitting their projects to the ECVC? Stay current, but if you're bringing something entirely new to the table, that would be best! Everybody at ECVC is looking to be blown away (I think) so throwing something totally new and unexpected their way is the best way to get their attention.
What is your favorite Computer Vision blog/website to stay up-to-date on developments in the sector? I generally read Tombone's CV blog, by Tomasz Malisiewicz*, and follow CV conferences like ECCV, ICML, CVPR to look up the bleeding edge in the industry and this usually gives a fair idea of the biggest problems people are looking to tackle in the current age.
*Editor’s Note: Tomasz Malisiewicz was a speaker at the 2016 Vision Summit
At the LDV Vision Summit 2017, Evan Nisselson had the privilege to sit down with Josh Kopelman, the self-described "accidental VC" and Partner at First Round Capital, to discuss investment trends and what Josh looks for in a founder at the seed-stage level
According to Josh, First Round Capital either invests early or way too early. Each technology has a window in which it is investable and you want to avoid funding a company too early. At First Round, they are investing in things that are trying to solve common problems. Watch Josh and Evan's fireside chat to learn more:
Fireside Chat - Josh Kopelman of First Round Capital and Evan Nisselson of LDV Capital - Vimeo
Our fifth annual LDV Vision Summit will be May 23 & 24, 2018 in NYC. Early bird tickets are currently on sale. Sign up to our LDV Vision Summit newsletter for updates and deals on tickets.
Sign up with your email address to receive news and updates about our next Annual LDV Vision Summit