Microsoft is privileged to work with leading-edge customers and partners who are taking the power of the cloud and artificial intelligence and applying it to their businesses in novel ways. Our new series, How AI Transforms Business, features insights from selective such customers and partners. Join us in these conversations and see how your company and customers may be able to benefit from these solutions and insights.
1. How Can Autonomous Drones Help the Energy and Utilities Industry?
Headquartered in Norway, eSmart Systems develops digital intelligence for the energy industry and for smart communities. When it comes to next-generation grid management systems or efficiently running operations for the connected cities of the future, they are at forefront of digital transformation. In a conversation with Joseph Sirosh, CTO of AI in Microsoft’s Worldwide Commercial Business, Davide Roverso, Chief Analytics Officer at eSmart Systems, talks about interesting new AI-enabled scenarios in world of energy, utilities and physical infrastructure.
Welcome to How AI Transform Business, a new series featuring insights from conversations with Microsoft partners who are combining deep industry knowledge with AI in novel ways and, in doing so, creating leading-edge intelligent business solutions for our digital age.
Our first episode features eSmart Systems, which is in the business of creating solutions to accelerate global progress towards sustainable societies. Headquartered in the heart of Østfold county, Norway, eSmart Systems develops digital intelligence for the energy industry and for smart communities. The company is strategically co-located with the NCE Smart Energy Markets cluster and the Østfold University College and thrives in a very innovative environment. When it comes to next-generation grid management systems, or efficiently running operations for the connected cities of the future or driving citizen engagement, the company is at the forefront of digital transformation.
We recently caught up with Davide Roverso, Chief Analytics Officer at eSmart Systems. Davide has many interesting things to share about where and how AI is being applied in the infrastructure industry. Among other things, he talks about how utilities companies are forced to fly manned helicopters missions over live electrical power lines today, just to perform routine inspections, and how – using AI – it is possible to have safer and more effective inspections that do not expose humans to this sort of risk.
Davide Roverso, Chief Analytics Officer, eSmart Systems, in conversation with Joseph Sirosh,
Chief Technology Officer of Artificial Intelligence in Microsoft’s Worldwide Commercial Business.
Video and podcasts versions of this session are available via the links below. Alternatively – just continue reading a transcript of their conversation below.
Joseph Sirosh: Davide, would you tell a little about eSmart Systems and yourself?
Davide Roverso: eSmart Systems is a small Norwegian startup, was established in 2013. The main area in which we work is building SaaS for the energy and utilities sector. So basically, it was founded by a group of people that had been working together for over 20 years in the energy and utilities space. They were first working a lot on power exchange software, and delivered power exchange to California, among others. And then, about 2012, they went for a kind of exploration trip to the US, to Silicon Valley and that area, and they visited Google and Amazon and Microsoft and Cloudera and tried to find what were the new biggest trends. And they came back home with a clear idea that they had to focus on cloud and AI. And of course, they used that in their core business and that was power and utilities.
So that’s how eSmart Systems started.
JS: And so, you have an analytics team, or now is it an AI team?
DR:We have 10 data scientists, so more than 10% of the company is data scientists, so we have a big focus on AI. When I started in eSmart Systems about three years ago we were just two, so I built quite a good group since then. And we use machine learning in a lot of different areas. Two main areas are specifically time series analysis and predictions, and the other is more on analyzing images – we use that for inspecting, for instance, power lines with drones.
JS: You must have a lot of interesting projects. So, tell me, in the power and utilities industry, where is AI used?
DR:Well, we mainly work with the DSOs, distribution system operators, which are kind of responsible for distributing power to end users. Up to few years ago they were basically operating blind because the last lowest voltage network is not instrumented. But since the introduction of smart meters, every home now – well in most of the European countries they are rolling out smart meters and the same in most of the US – every home now basically has a sensor. So now, suddenly they have much more data they can use to more intelligently steer the grid. So, there AI we use mostly to make predictions of loads and consumption from different types of customers, both household and industry customers.
And this is very important information, especially now, with the large introduction of distribution energy resources – all the renewables that are coming online. A lot of people are installing solar panels on the roofs. A lot of end users are now what we call prosumers, so they both produce and consume electricity, so there’s a two-way flow of power and data. So, there are lots of opportunities to optimize this new kind of smart grid that is becoming more and more widespread now.
JS: Very interesting. So, what are some of the most exciting AI applications that you have seen now in the power industry and in what you are doing?
DR: We are developing some very exciting applications in the space of inspections. We are combining AI with drones. Of course, the electrical infrastructure is relatively old and requires quite a lot of maintenance and inspections. And, so far, these inspections have been mostly done manually, so periodically people actually walk along the lines and climb up the poles and check infrastructure. And the last few years they have started using helicopters, and they fly helicopters – quite dangerous missions because they have to be quite close to the power lines, and every year there are reports of near incidents. So, it is quite an expensive process, but it is, of course, necessary, and even more necessary as the infrastructure ages even more.
So, the idea here is to use drones to have a cheaper, more effective inspection. And here, it is very exciting to use all the new technology that we have today for this kind of image intelligence that we have, with deep networks and convolutional neural networks. So, recognizing infrastructure, recognizing different types of faults and anomalies.
“It is very exciting to use all the new technology that we have today… with deep networks and convolutional neural networks, [for] recognizing infrastructure, recognizing different types of faults and anomalies.”
JS: And so, how do you use the cloud?
DR: Our systems are basically deployed in the cloud. So, the smart meter / smart grid systems, they collect data from smart meters and upload everything in the cloud. And all the analysis – all the machine learning and AI – happens in the cloud. And the same for the drones. Well, there are different missions. If it’s kind of a periodic inspection, then time is not the big issue, you can analyze the images in batch, and then we use cloud for that. So, we upload – it can be hundreds of thousands of images – and process them in the cloud.
JS: So, what is the advantage that cloud brings you, cloud and AI together?
DR: It is scalability. Regardless of how many drones or how many pictures our customers are sending to the systems, we are able to serve those.
JS: Near instantly being able to provision as many resources as you want. Okay, that’s very good.
DR: Also, edge is very important, it’s not just the cloud, the intelligent…
JS: Intelligent cloud and intelligent edge.
DR: Because if you’re on a mission for finding a fault or outage as quickly as possible then you need intelligence on the edge. And you also need that if you want to have autonomous drones, of course. Because today, we still don’t have fully autonomous drones – we still have pilots that remotely pilot the drones – but of course, the longer-term vision is to have fully autonomous drones.
JS: So, have you developed a prototype of autonomous drones that can follow power lines?
DR: Yes, to follow power lines and then position itself in the optimum spots to take the correct pictures for the detailed inspection. So the drone is not doing the detailed inspection – that happens in the cloud – but is using edge AI to localize the components, the assets that we need to inspect and take the right pictures and then move on to the next.
JS: Is AI scary?
DR: Not today. But it can be, in the future, you know. Your probably read Bostrom’s book “Superintelligence” that came out in 2014, I think. So, he envisioned like a superintelligence that will take over, and we will not even notice that because it will come so fast we won’t realize. But this is a long time away. But anyway, today there are philosophical and ethical questions that are important to ask ourselves. And there are big institutes both in the UK and in the US that focus on that, so that’s important. But todays technologies can be weaponized in a way, so there is that kind of scary side of it, of using AI without ethical controls, for autonomous weapons. So, there are some initiatives there. In my opinion, there should be an international agreement on how to control autonomy.
JS: But all technologies are the same way, I would think.
DR: Of course.
JS: What are some of the most exciting AI developments you have seen recently?
DR: Well, of course, all the developments around visual intelligence as I call it – so all the analysis of images, segmentation, detecting objects, and things like that with deep neural networks, and convolutional neural networks – it’s very exciting. And one very exciting development is, of course, self-driving cars. That, for me, is very exciting, and I use it a lot as an example in my presentations because it both showcases vision development / technological development but also its an application that basically touches almost everyone. Everyone drives a car, at least in the developed world, so it’s one of the applications that will come – that we will feel – much more quickly than other ones. But, of course, all the developments around language and speech recognition, and all these new intelligent systems and bots that are coming, it’s very exciting developments. From the research point of view, I like a lot of what is happening around the games and gaming in AI. You know, we both started working on AI in the nineties, and at that time, well since the beginning, AI has been applied to games – from checkers, and then chess, Deep Blue beating Kasparov in ’97, and then, more recently, of course, AlphaGo, and AlphaZero, even more exciting and now the latest one with Open AI playing Dota 2 – so, it’s a very nice way of developing new concepts. It doesn’t have direct applications in the real world, but it develops kind of fundamental capabilities that real world systems are going to need.
JS: Any thoughts about the applications of AI outside of the power industry, some of the most exciting other areas that you might be able to go into?
DR: Yeah, well – basically all the work that we are doing both around images and inspections is applicable to other…
JS: … all types of inspections. Yeah, one thing I heard sometime recently was about inspecting for lightning strikes on aircraft. And they were looking to see if you can use AI to identify, because today again somebody has to climb the airplane and go look at spots and see if there has been a lightning strike.
DR: Or inspecting like pipelines, or railways – any kind of infrastructure.
JS: Or even assets, even just counting assets, is one thing I heard, which was interesting.
DR: Almost limitless amount of applications.
JS: Very exciting. Any concluding thoughts on AI and its applications?
DR: Well, it’s very exciting times. I’ve been working in AI for 30 years and finally we see a lot of traction, and we see an explosion of applications and interest and money nonetheless coming into AI. And real applications that are both helpful and exciting.
JS: And do you think AI is being democratized – made available to software developers much more easily?
DR: Yeah, definitely. Today, basically anyone can experiment with AI. Maybe it’s still difficult to make an application that is production-ready if you are not a data scientist because you can fall in many places – you can make a lot of mistakes if you don’t know what you’re doing. But you can experiment and generate something useful in a much easier way than before. So, there’s been a lot of progress around that and there is going to be more progress – I cannot even say in the years to come, just weeks!
JS: Wonderful, it’s been a pleasure talking to you.
DR: Thank you, it’s been a pleasure.
“It’s very exciting times. I’ve been working in AI for 30 years and finally we see a lot of traction, and we see an
explosion of applications and interest…”
We hope you enjoyed this post. This being our first episode in the series, we are eager to hear your feedback, so please share your thoughts and ideas below.
This post is authored by Anusua Trivedi, Senior Data Scientist at Microsoft.
This post builds on the MRC Blog where we discussed how machine reading comprehension (MRC) can help us “transfer learn” any text. In this post, we introduce the notion of and the need for machine reading at scale, and for transfer learning on large text corpuses.
Machine reading for question answering has become an important testbed for evaluating how well computer systems understand human language. It is also proving to be a crucial technology for applications such as search engines and dialog systems. The research community has recently created a multitude of large-scale datasets over text sources including:
Wikipedia (WikiReading, SQuAD, WikiHop).
News and newsworthy articles (CNN/Daily Mail, NewsQA, RACE).
Fictional stories (MCTest, CBT, NarrativeQA).
General web sources (MS MARCO, TriviaQA, SearchQA).
These new datasets have, in turn, inspired an even wider array of new question answering systems.
In the MRC blog post, we trained and tested different MRC algorithms on these large datasets. We were able to successfully transfer learn smaller text excepts using these pretrained MRC algorithms. However, when we tried creating a QA system for the Gutenberg book corpus (English only) using these pretrained MRC models, the algorithms failed. MRC usually works on text excepts or documents but fails for larger text corpuses. This leads us to a newer concept – machine reading at scale (MRS). Building machines that can perform machine reading comprehension at scale would be of great interest for enterprises.
Machine Reading at Scale (MRS)
Instead of focusing on only smaller text excerpts, Danqi Chen et al. came up with a solution to a much bigger problem which is machine reading at scale. To accomplish the task of reading Wikipedia to answer open-domain questions, they combined a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs.
MRC is about answering a query about a given context paragraph. MRC algorithms typically assume that a short piece of relevant text is already identified and given to the model, which is not realistic for building an open-domain QA system.
In sharp contrast, methods that use information retrieval over documents must employ search as an integral part of the solution.
MRS strikes a balance between the two approaches. It is focused on simultaneously maintaining the challenge of machine comprehension, which requires the deep understanding of text, while keeping the realistic constraint of searching over a large open resource.
Why is MRS Important for Enterprises?
The adoption of enterprise chatbots has been rapidly increasing in recent times. To further advance these scenarios, research and industry has turned toward conversational AI approaches, especially in use cases such as banking, insurance and telecommunications, where there are large corpuses of text logs involved.
One of the major challenges for conversational AI is to understand complex sentences of human speech in the same way humans do. The challenge becomes more complex when we need to do this over large volumes of text. MRS can address both these concerns where it can answer objective questions from a large corpus with high accuracy. Such approaches can be used in real-world applications like customer service.
In this post, we want to evaluate the MRS approach to solve automatic QA capability across different large corpuses.
Training MRS – DrQA Model
DrQA is a system for reading comprehension applied to open-domain question answering. DrQA is specifically targeted at the task of machine reading at scale. In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (which may not be redundant). Thus, the system must combine the challenges of document retrieval (i.e. finding relevant documents) with that of machine comprehension of text (identifying the answers from those documents).
We use Deep Learning Virtual Machine (DLVM) as the compute environment with two NVIDIA Tesla P100 GPU, CUDA and cuDNN libraries. The DLVM is a specially configured variant of the Data Science Virtual Machine (DSVM) that makes it more straightforward to use GPU-based VM instances for training deep learning models. It is supported on Windows 2016 and the Ubuntu Data Science Virtual Machine. It shares the same core VM images – and hence the same rich toolset – as the DSVM, but is configured to make deep learning easier. All the experiments were run on a Linux DLVM with two NVIDIA Tesla P100 GPUs. We use the PyTorch backend to build the models. We pip installed all the dependencies in the DLVM environment.
We fork the Facebook Research GitHub for our blog work and we train the DrQA model on SQUAD dataset. We use the pre-trained MRS model for evaluating our large Gutenberg corpuses using transfer learning techniques.
Children’s Gutenberg Corpus
We created a Gutenberg corpus consisting of about 36,000 English books. We then created a subset of Gutenberg corpus consisting of 528 children’s books.
Pre-processing the children’s Gutenberg dataset:
Download books with filter (e.g. children, fairy tales etc.).
Clean the downloaded books.
Extract text data from book content.
How to Create a Custom Corpus for DrQA to Work?
We follow the instructions available here to create a compatible document retriever for the Gutenberg Children’s books.
To execute the DrQA model:
Insert a query in the UI and click the search button.
This calls the demo server (flask server running in the backend).
For environment setup, please follow ReadMe.md in GitHub to download the code and install dependencies. For all code and related details, please refer to our GitHub link here.
MRS Using DLVM
Please follow similar steps listed in this notebook to test the DrQA model on DLVM.
Learnings from Our Evaluation Work
In this post, we investigated the performance of the MRS model on our own custom dataset. We tested the performance of the transfer learning approach for creating a QA system for around 528 children’s books from the Project Gutenberg Corpus using the pretrained DrQA model. Our evaluation results are captured in the exhibits below and in the explanation that follows. Note that these results are particular to our evaluation scenario – results will vary for other documents or scenarios.
In the above examples, we tried questions beginning with What, How, Who, Where and Why – and there’s an important aspect about MRC that is worth noting, namely:
MRC is best suited for “factoid” questions. Factoid questions are about providing concise facts. E.g. “Who is the headmaster of Hogwarts?” or “What is the population of Mars”. Thus, for the What, Who and Where types of questions above, MRC works well.
For non-factoid questions (e.g. Why), MRC does not do a very good job.
The green box represents the correct answer for each question. As we see here, for factoid questions, the answers chosen by the MRC model are in line with the correct answer. In the case of the non-factoid “Why” question, however, the correct answer is the third one, and it’s the only one that makes any sense.
Overall, our evaluation scenario shows that for generic large document corpuses, the DrQA model does a good job of answering factoid questions.
@anurive | Email Anusua at firstname.lastname@example.org for questions pertaining to this post.
A special guest post by cricket legend and founder of Spektacom Technologies, Anil Kumble. This post was co-authored by Tara Shankar Jana, Senior Technical Product Marketing Manager at Microsoft.
While cricket is an old sport with a dedicated following of fans across the globe, the game has been revolutionized in the 21st century with the advent of the Twenty20 format. This shorter format has proven to be very popular, resulting in a massive growth of interest in the game and a big fan following worldwide. This has, in turn, led to increased competitiveness and the desire on the part of both professionals and amateurs alike to take their game quality to the next level.
As the popularity of the game has increased, so have innovative methods of improving batting techniques. This has resulted in a need for data-driven assistance for players, information that will allow them to digitally assess their quality of game.
Spektacom was born from the idea of using non-intrusive sensor technology to harness data from “power bats” and using that data to power insights driven by the cloud and artificial intelligence.
Before we highlight how Spektacom built this solution using Microsoft AI, there are a couple of important questions we must address first.
What Differentiation Can Technology and AI Create in the Sports Industry?
In the last several years, the industry has realized the value of data across almost every sport and found ways to collect and organize that data. Even though data collection has become an essential part of the sporting industry, it is insufficient to drive future success unless coupled with the ability to derive intelligent insights that can then be put to use. Data, when harnessed strategically with the help of intelligent technologies such as machine learning and predictive analytics, can help teams, leagues and governing bodies transform their sports through better insights and decision-making in three critical areas of the business, namely:
Fan engagement and management.
Team and player performance.
Team operations and logistics.
For many professional sports teams and governing bodies, the factors that led to past success will not necessarily translate into future victories. The things that gave teams a competitive advantage in the past have now become table stakes. People are consuming sports in new ways and fans have come to expect highly personalized experiences, being able to track the sports content they want, whenever and wherever they want it. AI can help transform the way sports are played, teams are managed, and sports businesses are run. New capabilities such as machine learning, chatbots and more are helping improve even very traditional sports in unexpected new areas.
What Impact will Spektacom’s Technology Have on the Game?
Spektacom’s technology will help in a few critical areas of the game. It will:
Enhance the fan experience and fan engagement with the sport.
Enable broadcasters to use insights for detailed player analysis.
Allow grassroots players and aspiring cricketers to increase their technical proficiency.
Allow coaches to provide more focused guidance to individual players.
Allow professional cricketers to further enhance their performance.
This technology has the capability to change the face of cricket as we know it. Prior to the introduction of this technology, there has not been an objective data-driven way to analyze a batsman’s performance.
Spektacom “Power Bat" - YouTube
Let’s take a closer look at the solution Spektacom has built, in partnership with Microsoft.
The Spektacom Solution
Introducing “power bats”, an innovative sensor-enabled bat that measures the quality of a player’s shot by capturing data and analyzing impact characteristics through wireless technology and cloud analytics.
This unique non-intrusive sensor platform weighs less than 5 grams and is applied behind the bat just like any ordinary sticker. Once affixed, it can measure the quality, speed, twist and swing of the bat as it comes in contact with the ball, and it can also compute the power transferred from the ball to bat at impact. These parameters are used to compute the quality of the shot – information that both professional and amateur players (as well as their coaches and other stakeholders) can use to improve player performance via instant feedback.The Spektacom solution is powered by Azure Sphere, Azure IoT Hub, Azure Event Hub, Azure Functions, Azure Cosmos DB and Azure Machine Learning.
The data from the power bats gets analyzed with powerful AI models developed in Azure and get transferred to the edge for continuous player feedback. In the case of the professional game, the sticker communicates using Bluetooth Low Energy (BLE) with an edge device called the Stump Box which is buried behind the wicket. The data from the stump box is transferred and analyzed in Azure and shot characteristics are provided to broadcasters in real-time. Since cricket stadiums have wireless access restrictions and stringent security requirements, to ensure secure communication between the bat, edge device and Azure, Stump Box has been powered by Microsoft’s Azure Sphere -based hardware platform. In case of amateur players, the smart bat pairs with the Spektacom mobile app to transfer and analyze sensor data in Azure.
Microsoft makes it easy for you to get going on your own innovative AI-based solutions – start today at the Microsoft AI Lab.
Announcing new open source contributions to the Apache Spark community for creating deep, distributed, object detectors – without a single human-generated label
This post is authored by members of the Microsoft ML for Apache Spark Team – Mark Hamilton, Minsoo Thigpen,
Abhiram Eswaran, Ari Green, Courtney Cochrane, Janhavi Suresh Mahajan, Karthik Rajendran, Sudarshan Raghunathan, and Anand Raman.
In today’s day and age, if data is the new oil, labelled data is the new gold.
Here at Microsoft, we often spend a lot of our time thinking about “Big Data” issues, because these are the easiest to solve with deep learning. However, we often overlook the much more ubiquitous and difficult problems that have little to no data to train with. In this work we will show how, even without any data, one can create an object detector for almost anything found on the web. This effectively bypasses the costly and resource intensive processes of curating datasets and hiring human labelers, allowing you to jump directly to intelligent models for classification and object detection completely in sillico.
We apply this technique to help monitor and protect the endangered population of snow leopards.
Bing on Spark: Makes it easier to build applications on Spark using Bing search.
LIME on Spark: Makes it easier to deeply understand the output of Convolutional Neural Networks (CNN) models trained using SparkML.
High-performance Spark Serving: Innovations that enable ultra-fast, low latency serving using Spark.
We illustrate how to use these capabilities using the Snow Leopard Conservation use case, where machine learning is a key ingredient towards building powerful image classification models for identifying snow leopards from images.
Use Case – The Challenges of Snow Leopard Conservation
Snow leopards are facing a crisis. Their numbers are dwindling as a result of poaching and mining, yet little is known about how to best protect them. Part of the challenge is that there are only about four thousand to seven thousand individual animals within a potential 1.5 million square kilometer range. In addition, Snow Leopard territory is in some of the most remote, rugged mountain ranges of central Asia, making it near impossible to get there without backpacking equipment.
Figure 1: Our team’s second flat tire on the way to snow leopard territory.
To truly understand the snow leopard and what influences its survival rates, we need more data. To this end, we have teamed up with the Snow Leopard Trust to help them gather and understand snow leopard data. “Since visual surveying is not an option, biologists deploy motion-sensing cameras in snow leopard habitats that capture images of snow leopards, prey, livestock, and anything else that moves,” explains Rhetick Sengupta, Board President of Snow Leopard Trust. “They then need to sort through the images to find the ones with snow leopards in order to learn more about their populations, behavior, and range.” Over the years these cameras have produced over 1 million images. The Trust can use this information to establish new protected areas and improve their community-based conservation efforts.
However, the problem with camera-trap data is that the biologists must sort through all the images to distinguish photos of snow leopards and their prey from photos which have neither. “Manual image sorting is a time-consuming and costly process,” Sengupta says. “In fact, it takes around 300 hours per camera survey. In addition, data collection practices have changed over the years.”
We have worked to help automate the Trust’s snow leopard detection pipeline with Microsoft Machine Learning for Apache Spark (MMLSpark). This includes both classifying snow leopard images, as well as extracting detected leopards to identify and match to a large database of known leopard individuals.
Step 1: Gathering Data
Gathering data is often the hardest part of the machine learning workflow. Without a large, high-quality dataset, a project is likely never to get off the ground. However, for many tasks, creating a dataset is incredibly difficult, time consuming, or downright impossible. We were fortunate to work with the Snow Leopard Trust who have already gathered 10 years of camera trap data and have meticulously labelled thousands of images. However, the trust cannot release this data to the public, due to risks from poachers who use photo metadata to pinpoint leopards in the wild. As a result, if you are looking to create your own Snow Leopard analysis, you need to start from scratch.
Figure 2: Examples of camera trap images from the Snow Leopard Trust’s dataset.
Announcing: Bing on Spark
Confronted with the challenge of creating a snow leopard dataset from scratch, it’s hard to know where to start. Amazingly, we don’t need to go to Kyrgyzstan and set up a network of motion sensitive cameras. We already have access to one of the richest sources of human knowledge on the planet – the internet. The tools that we have created over the past two decades that index the internet’s content not only help humans learn about the world but can also help the algorithms we create do the same.
Today we are releasing an integration between the Azure Cognitive Services and Apache Spark that enables querying Bing and many other intelligent services at massive scales. This integration is part of the Microsoft ML for Apache Spark (MMLSpark) open source project. The Cognitive Services on Spark make it easy to integrate intelligence into your existing Spark and SQL workflows on any cluster using Python, Scala, Java, or R. Under the hood, each Cognitive Service on Spark leverages Spark’s massive parallelism to send streams of requests up to the cloud. In addition, the integration between SparkML and the Cognitive Services makes it easy to compose services with other models from the SparkML, CNTK, TensorFlow, and LightGBM ecosystems.
Figure 3: Results for Bing snow leopard image search.
We can use Bing on Spark to quickly create our own machine learning datasets featuring anything we can find online. To create a custom snow leopard dataset takes only two distributed queries. The first query creates the “positive class” by pulling the first 80 pages of the “snow leopard” image results. The second query creates the “negative class” to compare our leopards against. We can perform this search in two different ways, and we plan to explore them both in upcoming posts. Our first option is to search for images that would look like the kinds of images we will be getting out in the wild, such as empty mountainsides, mountain goats, foxes, grass, etc. Our second option draws inspiration from Noise Contrastive Estimation, a mathematical technique used frequently in the Word Embedding literature. The basic idea behind noise contrastive estimation is to classify our snow leopards against a large and diverse dataset of random images. Our algorithm should not only be able to tell a snow leopard from an empty photo, but from a wide variety of other objects in the visual world. Unfortunately, Bing Images does not have a random image API we could use to make this dataset. Instead, we can use random queries as a surrogate for random sampling from Bing. Generating thousands of random queries is surprisingly easy with one of the multitude of online random word generators. Once we generate our words, we just need to load them into a distributed Spark DataFrame and pass them to Bing Image Search on Spark to grab the first 10 images for each random query.
With these two datasets in hand, we can add labels, stitch them together, dedupe, and download the image bytes to the cluster. SparkSQL parallelizes this process and can speed up the download by orders of magnitude. It only takes a few seconds on a large Azure Databricks cluster to pull thousands of images from around the world. Additionally, once the images are downloaded, we can easily preprocess and manipulate them with tools like OpenCV on Spark.
Figure 4: Diagram showing how to create a labelled dataset for snow leopard classification using Bing on Spark.
Step 2: Creating a Deep Learning Classifier
Now that we have a labelled dataset, we can begin thinking about our model. Convolutional neural networks (CNNs) are today’s state-of-the-art statistical models for image analysis. They appear in everything from driverless cars, facial recognition systems, and image search engines. To build our deep convolution network, we used MMLSpark, which provides easy-to-use distributed deep learning with the Microsoft Cognitive Toolkit on Spark.
MMLSpark makes it especially easy to perform distributed transfer learning, a deep learning technique that mirrors how humans learn new tasks. When we learn something new, like classifying snow leopards, we don’t start by re-wiring our entire brain. Instead, we rely on a wealth of prior knowledge gained over our lifetimes. We only need a few examples, and we quickly become high accuracy snow leopard detectors. Amazingly transfer learning creates networks with similar behavior. We begin by using a Deep Residual Network that has been trained on millions of generic images. Next, we cut off a few layers of this network and replace them with a SparkML model, like Logistic Regression, to learn a final mapping from deep features to snow leopard probabilities. As a result, our model leverages its previous knowledge in the form of intelligent features and can adapt itself to the task at hand with the final SparkML Model. Figure 5 shows a schematic of this architecture.
It’s important to remember that our algorithm can learn from data sourced entirely from Bing. It did not need hand labeled data, and this method is applicable to almost any domain where an image search engine can find your images of interest.
Figure 5: A diagram of transfer learning with ResNet50 on Spark.
Step 3: Creating an Object Detection Dataset with Distributed Model Interpretability
At this point, we have shown how to create a deep image classification system that leverages Bing to eliminate the need for labelled data. Classification systems are incredibly useful for counting the number of sightings. However, classifiers tell us nothing about where the leopard is in the image, they only return a probability that a leopard is in an image. What might seem like a subtle distinction, can really make a difference in an end to end application. For example, knowing where the leopard is can help humans quickly determine whether the label is correct. It can also be helpful for situations where there might be more than one leopard in the frame. Most importantly for this work, to understand how many individual leopards remain in the wild, we need to cross match individual leopards across several cameras and locations. The first step in this process is cropping the leopard photos so that we can use wildlife matching algorithms like HotSpotter.
Ordinarily, we would need labels to train an object detector, aka painstakingly drawn bounding boxes around each leopard image. We could then train an object detection network learn to reproduce these labels. Unfortunately, the images we pull from Bing have no such bounding boxes attached to them, making this task seem impossible.
At this point we are so close, yet so far. We can create a system to determine whether a leopard is in the photo, but not where the leopard is. Thankfully, our bag of machine learning tricks is not yet empty. It would be preposterous if our deep network could not locate the leopard. How could anything reliably know that there is a snow leopard in the photo without seeing it directly? Sure, the algorithm could focus on aggregate image statistics like the background or the lighting to make an educated guess, but a good leopard detector should know a leopard when it sees it. If our model understands and uses this information, the question is “How do we peer into our model’s mind to extract this information?”.
Thankfully, Marco Tulio Ribeiro and a team of researchers at the University of Washington have created an method called LIME (Local Interpretable Model Agnostic Explanations), for explaining the classifications of any image classifier. This method allows us to ask our classifier a series of questions, that when studied in aggregate, will tell us where the classifier is looking. What’s most exciting about this method, is that it makes no assumptions about the kind of model under investigation. You can explain your own deep network, a proprietary model like those found in the Microsoft cognitive services, or even a (very patient) human classifier. This makes it widely applicable not just across models, but also across domains.
Figure 6: Diagram showing the process for interpreting an image classifier.
Figure 6 shows a visual representation of the LIME process. First, we will take our original image, and break it into “interpretable components” called “superpixels”. More specifically, superpixels are clusters of pixels that groups pixels that have a similar color and location together. We then take our original image and randomly perturb it by “turning off” random superpixels. This results in thousands of new images which have parts of the leopard obscured. We can then feed these perturbed images through our deep network to see how our perturbations affect our classification probabilities. These fluctuations in model probabilities help point us to the superpixels of the image that are most important for the classification. More formally, we can fit a linear model to a new dataset where the inputs are binary vectors of superpixel on/off states, and the targets are the probabilities that the deep network outputs for each perturbed image. The learned linear model weights then show us which superpixels are important to our classifier. To extract an explanation, we just need to look at the most important superpixels. In our analysis, we use those that are in the top ~80% of superpixel importances.
LIME gives us a way to peer into our model and determine the exact pixels it is leveraging to make its predictions. For our leopard classifier, these pixels often directly highlight the leopard in the frame. This not only gives us confidence in our model, but also providing us with a way to generate richer labels. LIME allows us to refine our classifications into bounding boxes for object detection by drawing rectangles around the important superpixels. From our experiments, the results were strikingly close to what a human would draw around the leopard.
Figure 7: LIME pixels tracking a leopard as it moves through the mountains
Announcing: LIME on Spark
LIME has amazing potential to help users understand their models and even automatically create object detection datasets. However, LIME’s major drawback is its steep computational cost. To create an interpretation for just one image, we need to sample thousands of perturbed images, pass them all through our network, and then train a linear model on the results. If it takes 1 hour to evaluate your model on a dataset, then it could take at least 50 days of computation to convert these predictions to interpretations. To help make this process feasible for large datasets, we are releasing a distributed implementation of LIME as part of MMLSpark. This will enable users to quickly interpret any SparkML image classifier, including those backed by deep network frameworks like CNTK or TensorFlow. This helps make complex workloads like the one described, possible in only a few lines of MMLSpark code. If you would like to try the code, please see our example notebook for LIME on Spark.
Figure 8: Left: Outline of most important LIME superpixels. Right: example of human-labeled bounding box (blue) versus the LIME output bounding box (yellow)
Step 4: Transferring LIME’s Knowledge into a Deep Object Detector
By combining our deep classifier with LIME, we have created a dataset of leopard bounding boxes. Furthermore, we accomplished this without having to manually classify or labelling any images with bounding boxes. Bing Images, Transfer Learning, and LIME have done all the hard work for us. We can now use this labelled dataset to learn a dedicated deep object detector capable of approximating LIME’s outputs at a 1000x speedup. Finally, we can deploy this fast object detector as a web service, phone app, or real-time streaming application for the Snow Leopard Trust to use.
To build our Object Detector, we used the TensorFlow Object Detection API. We again used deep transfer learning to fine-tune a pre-trained Faster-RCNN object detector. This detector was pre-trained on the Microsoft Common Objects in Context (COCO) object detection dataset. Just like transfer learning for deep image classifiers, working with an already intelligent object detector dramatically improves performance compared to learning from scratch. In our analysis we optimized for accuracy, so we decided to use a Faster R-CNN network with an Inception Resnet v2. Figure 9 shows speeds and performances of several network architectures, FRCNN + Inception Resnet V2 models tend to cluster towards the high accuracy side of the plot.
Figure 9: Speed/accuracy tradeoffs for modern convolutional object detectors. (Source: Google Research.)
We found that Faster R-CNN was able to reliably reproduce LIME’s outputs in a fraction of the time. Figure 10 shows several standard images from the Snow Leopard Trust’s dataset. On these images, Faster R-CNN’s outputs directly capture the leopard in the frame and match near perfectly with human curated labels.
Figure 10: A comparison of human labeled images (left) and the outputs of the final trained Faster-RCNN on LIME predictions (right).
Figure 11: A comparison of difficult human labeled images (left) and the outputs of the final trained Faster-RCNN on LIME predictions (right).
However, some images still pose challenges to this method. In Figure 11, we examine several mistakes made by the object detector. In the top image, there are two discernable leopards in the frame, however Faster R-CNN is only able to detect the larger leopard. This is due to the method used to convert LIME outputs to bounding boxes. More specifically, we use a simple method that bounds all selected superpixels with a single rectangle. As a result, our bounding box dataset has at most one box per image. To refine this procedure, one could potentially cluster the superpixels to identify if there are more than one object in the frame, then draw the bounding boxes. Furthermore, some leopards are difficult to spot due to their camouflage and they slip by the detector. Part of this affect might be due to anthropic bias in Bing Search. Namely, Bing Image Search returns only the clearest pictures of leopards and these photos are much easier than your average camera trap photo. To mitigate this effect, one could engage in rounds of hard negative mining, augment the Bing data with hard to see leopards, and upweight those examples which show difficult to spot leopards.
Step 5: Deployment as a Web Service
The final stage in our project is to deploy our trained object detector so that the Snow Leopard trust can get model predictions from anywhere in the world.
Announcing: Sub-millisecond Latency with Spark Serving
Today we are excited to announce a new platform for deploying Spark Computations as distributed web services. This framework, called Spark Serving, dramatically simplifies the serving process in Python, Scala, Java and R. It supplies ultra-low latency services backed by a distributed and fault-tolerant Spark Cluster. Under the hood, Spark Serving takes care of spinning up and managing web services on each node of your Spark cluster. As part of the release of MMLSpark v0.14, Spark Serving saw a 100-fold latency reduction and can now handle responses within a single millisecond.
Figure 12: Spark Serving latency comparison.
We can use this framework to take our deep object detector trained with
This post is authored by Tara Shankar Jana, Senior Technical Product Marketing Manager at Microsoft.
What if we could infuse AI into the everyday tools we use, to delight everyday users? With just a little bit of creativity – and the power of the Microsoft AI platform behind us – it’s now become easier than ever to create AI-enabled apps that can wow users.
Introducing Snip Insights!
An open source cross-platform AI tool for intelligent screen capture, Snip Insights is a step change in terms of how users can generate insights from their screen captures. The initial prototype of Snip Insights, built for Windows OS and released at Microsoft Build 2018 in May, was created by Microsoft Garage interns based out of Vancouver, Canada.
Our team at Microsoft AI Lab, in collaboration with the Microsoft AI CTO team, took Snip Insights to the next level by giving the tool an intuitive new user interface, adding cross-platform support (for MacOS, Linux, and Windows), and offering free download and usage under the MSA license.
Snip Insights taps into Microsoft Azure Cognitive Services APIs and helps increase user productivity by automatically providing them with intelligent insights on their screen captures.
Snip Insights taps into cloud AI services and – depending on the image that was screen-captured – can convert it into translated text, automatically detect and tag images, and provide smart image suggestions that improve the user workflow. This simple act of combining a familiar everyday desktop tool with Azure Cognitive Services has helped us create a one-stop shop for image insights.
For instance, imagine that you’ve scanned a textbook or work report. Rather than having to manually type out the information in it, snipping it will now provide you with editable text, thanks to the power of OCR. Or perhaps you’re scrolling through your social media feed and come across someone wearing a cool pair of shoes – you can now snip that to find out where to purchase them. Snip Insights can even help you identify famous people and popular landmarks.
In the past, you would have to take the screen shot, save the picture, upload it to an image search engine, and then draw your conclusions and insights from there.
This is so much smarter, isn’t it?
Celebrity Search: Snip a celebrity image and the tool will provide you with relevant information about them.
Object Detection and Bing Visual Search: You dig that T-shirt your friend is wearing in their latest social media post and want to know where you can buy it from. No problem! Just use Snip Insights and you can see matching product images and where to buy them from – all in a matter of seconds!
OCR, Language Translation and Cross-Platform Support: You find a quotation or phrase in English and wish to convert that to French or another language. Just use Snip Insights and you can do so effortlessly. What’s more, the tool is free and works on Windows, Linux and MacOS, so everybody’s covered!
Snip Insights is available on these three platforms:
Universal Windows Platform (UWP)
Xamarin.Forms enables you to build native UIs for iOS, Android, macOS, Linux, and Windows from a single, shared codebase.
You can dive into app development with Xamarin.Forms by following our free self-guided learning from Xamarin University. Xamarin.Forms has preview support for GTK# apps. GTK# is a graphical user interface toolkit that links the GTK+ toolkit and a variety of GNOME libraries, allowing the development of fully native GNONE graphics apps using Mono and .NET. Learn more here: Xamarin.Forms GTK#.
To add the keys to Snip Insights, a Microsoft Garage Project, start the application. Once running, click/tap the Settings icon in the toolbar. Scroll down until you find the “Cognitive Services, Enable AI assistance” toggle, and toggle it to the On position. You should now see the Insight Service Keys section.
Entity Search: Create new Entity Search Cognitive Service. Once created, you can display the keys. Select one and paste into “Settings”.
Image Analysis: In Azure, create a **Computer Vision API** Cognitive Service and use its key.
Image Search: In Azure, create a Bing Search v7 API Cognitive Service and use its key.
Text Recognition: You can use the same key used in Image Analysis above. Both Image Analysis and Text Recognition use the Computer Vision API.
Translator: Use the Translator Text API Cognitive Service.
Content Moderator: Use the Content Moderator API Cognitive Service.
For the LUIS App ID and Key, you will need to create a Language Understanding application in the Language Understanding Portal at https://www.luis.ai. Use the following steps to create your LUIS App and retrieve an App ID:
Click on Create new app button.
Provide an app name. Leave Culture (English) and Description as defaults.
In the left navigation pane, click Entities.
Click Manage prebuild entities.
Select datetimeV2 and email.
Click the Train button at the top of the page.
Click the Publish tab.
Click the Publish to production slot button.
At the bottom of the screen, you will see a list with a Key String field. Click the Copy button and paste that key value into the LUIS Key field in settings for Snip Insights.
Click the Settings tab (at the top).
Copy the Application ID shown and paste into the LUIS App Id field in Settings for Snip Insights.
You can now paste each key in the settings panel of the application. Remember to Click the Save button after entering all the keys.
NOTE: For each key entered there is a corresponding Service Endpoint. There are some default endpoints included (you can use these as an example) but when you copy each key, also check and replace the Service Endpoint for each service you are using. You will find the service endpoint for each Cognitive Service on the Overview Page. Remember to Click the Save button after updating all the Service Endpoints.
If you made it this far, and followed the above steps, you will have a fully working application to get started. Congratulations! We hope you have fun testing the project and thanks in advance for your contribution! You can find the code, solution development process and other details on GitHub.
We hope this post inspires you get started with AI today, and motivates you to become an AI developer.
By Joseph Sirosh, Corporate Vice President and CTO of AI, and Sumit Gulwani, Partner Research Manager, at Microsoft.
There are an estimated 250 million “knowledge workers” in the world, a term that encompasses anybody engaged in professional, technical or managerial occupations. These are individuals who, for most part, perform non-routine work that requires the handling of information and exercising the intellect and judgement. We, the authors of this blog post, count ourselves among them. So are a majority of you reading this post, regardless of whether you’re a developer, data scientist, business analyst or manager.
Although a majority of knowledge work tends to be non-routine, there are, nevertheless, many situations in which knowledge workers find ourselves doing tedious repetitive tasks as part of our day jobs, especially around tasks that involve manipulating data.
In this blog post, we take a look at Microsoft PROSE, an AI technology that can automatically produce software code snippets at just the right time and in just the right situations to help knowledge workers automate routine tasks that involve data manipulation. These are generally tasks that most users would otherwise find exceedingly tedious or too time consuming to even contemplate.
Examples of Tedious Everyday Knowledge Worker Tasks
Let’s take a couple of examples from the familiar world of spreadsheets to motivate this problem.
Figures 1a (above), 1b (below): A couple of examples of “data cleaning” tasks,
and how Excel “Flash Fill” saves the user a ton of tedious manual data entry.
Look at the task being performed by the user in the Excel screen in Figure 1a above. If you see the text the user is entering in cell B2, it looks like they have modified the data in the corresponding column A, to fit a certain desired format for phone numbers. You can also see them starting to attempt an identical transformation manually in the next cell below, i.e. cell B3.
Similarly, in cell E2 in Figure 1b above, it seems like the user is transforming the first and last names fields available in columns C and D, changing them into a format with just the last name followed by comma and capitalized first initial. They next attempt to accomplish an identical transformation, manually, in cell E3 which is right below it.
Excel recognizes that the user-entered data in cells B2 and B3 represents their desired “output” (i.e. for a certain format of telephone numbers) and that it corresponds to the “input” data available in column A. Similarly, Excel recognizes that the user-entered data in cells E2 and E3 represents a transformed output of the corresponding input data present in columns C and D. Having recognized the desired transformation pattern, Excel is able to display the [likely] desired user output – shown in gray font in the images above – in all the cells of columns B and E, in these two examples.
Regular Excel users among you will readily recognize this as Excel Flash Fill – a feature that we released five years ago and which has collectively saved our users millions of tedious hours of data grunge work.
Introduction to Microsoft PROSE
PROSE is short for Programming Synthesis using Examples, and it’s the technology underpinning of Excel Flash Fill.
PROSE has been through many major enhancements since its initial release in Excel. These new capabilities have since been released in many other products including Power BI, PowerShell and SQL Server Management Studio and are increasingly finding their way into many scenarios that involve big data and AI, including in Azure Log Analytics and Azure Machine Learning, where PROSE-generated scripts can be executed on very large datasets, including via the Azure Spark runtime.
In this post, we describe how PROSE works and some of the exciting new scenarios where its being applied. In many cases, PROSE delivers productivity gains that are well in excess of 100x.
How Does Microsoft PROSE Work?
PROSE works by automatically generating software programs based on input-output examples that are provided at runtime, usually by a user who is just going about their everyday tasks.
Given such input-output examples, PROSE generates a ranked set of software programs that are consistent with the examples provided. It then applies the output of its “best” program, with a view to help the user complete their broader task. This workflow is illustrated below.
Figure 2: How Microsoft PROSE works, under the covers.
To go back to the examples in Figure 1, what Excel is doing is displaying the output of the best PROSE-generated program using the gray colored font. The Excel user can accept these suggestions simply by hitting the Enter key. At this point, the user could provide additional examples, such as a correction they may apply to one of the auto-generated outputs. In such a situation, PROSE will try to further refine its final program, adapting it to the newest example provided. It will once again update the entire output column to reflect the updated ‘best program’.
A key technical challenge for PROSE is to search for programs in an underlying domain-specific language that are consistent with the user-provided examples. Our real-time search methodology leverages logical reasoning techniques and neural-guided heuristics to solve this issue.
Another challenge is to resolve the ambiguity that may be present in the user-provided examples since many programs can satisfy a few examples. Our Machine Learning -based ranking techniques often help us select an intended program from among the many that satisfy the examples. We also use active learning -based user interaction models that resemble an interactive conversation with the user, to iterate and arrive at the desired output.
The Microsoft PROSE SDK exposes these generic search and ranking algorithms, allowing advanced developers to construct PROSE capabilities for new task domains.
In the rest of this post, we look at a few additional scenarios where data scientists and developers and knowledge workers can use PROSE technology to get their tasks done faster and in a much more enjoyable manner. You can also look at a video overview of these scenarios.
Customer Use Cases and Microsoft PROSE Benefits
In this section, we highlight the benefit of using PROSE in the following scenarios:
In data preparation, for use by data scientists.
In Python Code Accelerator, for use by data scientists.
For generating code snippets, for use by software developers.
In code transformation, for use by software developers.
For table extraction from PDF files, for use by knowledge workers.
Scenario 1. Data Preparation
Although it may still be the sexiest job of the 21st century, being a data scientist sure involves spending lots of time on mundane data organization and analysis. In fact, it is estimated that data scientists end up spending as much as 80% of their time transforming data into formats that are more suitable for machine learning and AI.
This is where PROSE comes to the rescue. PROSE can automate several data manipulation tasks including string transformations (already seen in the Excel example above), in column-splitting, field extraction from log files and web pages, and normalizing semi-structured data into structured data. To take one example, consider the dataset in Figure 3a below, which reports raw temperature measurements.
Figure 3a: Raw temperature measurements
Rather than using these as-is, a data scientist may want to map these temperatures to different bins as part of featurization exercise. Unlike in the world of Excel, doing so manually in the world of big data is nigh impossible, therefore their best bet is to write a complex custom script.
They now have a much easier and faster alternative, which is to use PROSE to derive the new column based on a user-provided example, as shown in Figure 3b below.
Figure 3b: Transforming raw temperature measurements into interval
bands via the power of Microsoft PROSE plus a couple of user-provided examples.
As seen in the figure, as soon as the user types their desired output (or example) in the second column of row 2, PROSE determines the user’s intent, automatically generates the relevant code snippet, and uses it to correctly populate all the remaining rows, with the output of the PROSE-generated code snippet shown in gray colored font. Voila!
Scenario 2. Python Code Accelerator in Notebooks
PROSE, in general, requires user intent and sample data to generate code. Notebooks, because of their partial execution capability, are great platforms for interactive program synthesis using PROSE. A user typically develops script in Notebook one cell at a time, executing and evaluating the cell, and deciding on the next steps as she goes. After execution of each cell, new states are created, or old states are updated. At that time, user may decide to write code for the next cell on her own or invoke PROSE Code Accelerator which takes the user’s intent and the current state of the Notebook to synthesize code on user’s behalf. The code is readable and modifiable, like what the user might have written herself perhaps after spending much more time.
Figure 4a: Microsoft PROSE -powered Python Code Accelerator generating code to load a CSV file.
Notice in the above figure how PROSE analyzes the content of the file and generates Python code using libraries that the user may already familiar with. By using PROSE, user has saved several minutes of frustration and effort that she can now spend on more useful tasks.
Figure 4b: Microsoft PROSE -powered Python Code Accelerator generating code to fix the datatypes in a Python DataFrame.
Python users often struggle with wrong data type in data frames. PROSE intelligently analyzes the data and generates code to parse the values to the right data types and handle exception cases. Depending on the number of columns, it can be a huge time saver for Data Scientists.
Scenario 3. Generation of Code Snippets for Text Transformations
Consider a developer who needs to write a function to transform text inputs, but – rather than writing code – they want to just show the desired transformation via an example. Say, for instance, that they need to transform names from the format [First name] [Last name] to [Last name], [First initial]. E.g. if “Joseph Sirosh” was the input provided, they would want “Sirosh, J” as the desired output.
We did a fun implementation of this scenario in partnership with Stack Overflow where we created a chatbot for developers, one that uses PROSE behind the scenes to generate lots of different programs and figures out the best fit for a given example provided by the developer. Figure 5 below shows a Stack Overflow chatbot session that captures such an interaction.
Figure 5: Stack Overflow bot, powered by Microsoft PROSE. The bot provides code snippets in response to requested input/output transformation patterns.
This example showed pseudocode, but we could just as easily emit Python or Java.
Scenario 4. For Large Scale Code Transformation
PROSE has extensive applicability in scenarios that involve repetitive code transformations, including code reformatting and refactoring. In certain application migration scenarios, it is estimated that developers could end up spending as much as 40% of their entire time refactoring old code.
Take the example in Figure 6a below, where a SQL query written by another developer happens to use a different convention for naming a column than the one your organization prefers (this is called aliasing). For instance, the column aliasing for ExpectedShipDate is done using the “=” (equals to) operator, but your preference is to use “AS” for the same.
Figure 6a: Old code that needs to be reformatted.
Fortunately, you have the PROSE extension in your IDE (Integrated Development Environment) and, by giving a single example of the SQL transformation you wish to perform, i.e. by correcting just the one line of code with ExpectedShipDate as below:
DATEADD(DAY, 15, OrderDate) AS ExpectedShipDate,
… the IDE calls PROSE to take care of the rest, as shown in Figure 6b.
Figure 6b: Transformed code. Microsoft PROSE has correctly interpreted the developer’s intent,
correctly transforming all the column aliases to use AS instead of the “=” (equals to) operator.
Scenario 5. Table Extraction from Images and PDF Files
As knowledge workers, we frequently encounter tabular data that is rendered as an image or appears in a PDF file, rendering it useless for any fresh data analysis.
Luckily for us, PROSE is not limited to text and can take a variety of input formats, including images and PDFs.
Figure 7a: Table in a PDF file.
PROSE supports OCR which allows it to process this sort of scenario seamlessly. All the user needs to do is perform a selection operation to indicate the bounds of the table, and, using a technique called predictive synthesis, PROSE extracts the table into a corresponding “live” spreadsheet, as shown in Figure 7b. This is a capability provided by the PDF connector in Microsoft Power BI. It allows users to perform computations and analysis that were either inaccessible or would have required tedious manual data reentry.
Figure 7b: Table in Figure 7a extracted using the PDF connector in Microsoft Power BI.
Microsoft PROSE, or Program Synthesis by Example, is pre-defined suite of technologies applicable in a variety of tasks, including the cleaning and pre-processing of data into formats that are amenable to analysis.
The Microsoft PROSE SDK includes:
The Flash Fill example described above, currently available in Excel and PowerShell.
Data extraction from text files by examples, available in PowerShell and Azure Log Analytics.
Data extraction and transformation of JSON, by examples.
Predictive file-splitting technology, which splits a text file into structured columns without any examples.
As humans, we thrive in tasks that exercise our creativity and intellect and prefer avoiding tasks that are exceedingly tedious and repetitive. By successfully predicting user intent and automatically generating code snippets to automate everyday tasks involving data, Microsoft PROSE has saved our users millions of hours of manual work.
We may have named it PROSE, but for the knowledge workers who are saving tons of time and boosting their productivity, this AI technology is more like sweet poetry!
Their discussion initially focused on a new low-cost 3D-printed prosthetic arm that can “see” and which connects to cloud AI services to generate customized behaviors, such as different types of grips needed to grasp nearby objects. But the conversation soon pivoted into a discussion about the unlimited set of possibilities that open up when devices such as this are embedded with low-cost sensors, take advantage of cloud connectivity, sophisticated cloud services such as AI, link to other datasets and other things in the world around them.
True digital transformation is not about running a neural network or just about AI, as Joseph observes. It is about this ability to tap into software running as a service in the cloud, with the connectivity and global access that it brings. That can endow unexpected and almost magical new powers to ordinary everyday things.
Joseph draws the parallel between this gadget and the digital transformation that every company and every piece of software is going through. Eventually, nearly everything of some value in this world will be backed by a cloud service, will rely on similar connectivity and the ability to pool data to synthesize new behaviors – behaviors that are learned in the cloud and which can be tailored to each individual or situation.
That, along with the ability to improve continuously, is what sets apart this current wave of digital disruption.
Joseph concludes with the latter observation, i.e. that the key differentiator of this AI-powered platform of today is that – while traditional software does not improve (on its own accord) – this new software constantly improves, and that makes all the difference.
You can watch their full interview below:
Using AI to create inexpensive and intelligent prosthetics with Joseph Sirosh (Microsoft) - YouTube
This post is co-authored by Chun Ming Chin, Technical Program Manager, and Max Kaznady, Senior Data Scientist, of Microsoft, with Luyi Huang, Nicholas Kao and James Tayali, students at University of California at Berkeley.
This blog post is about the UC Berkeley Virtual Tutor project and the speech recognition technologies that were tested as part of that effort. We share best practices for machine learning and artificial intelligence techniques in selecting models and engineering training data for speech and image recognition. These speech recognition models, which are integrated with immersive games, are currently being tested at middle schools in California.
The University of California, Berkeley has a new program founded by alum and philanthropist Coleman Fung called the Fung Fellowship. In this program, students develop technology solutions to address education challenges such as enabling underserved children to help themselves in their learning. The solution involves building a Virtual Tutor that listens to what children say and interacts with them when playing educational games. The games were developed by a technology company founded by Coleman named Blue Goji. This work is being done in collaboration with the Partnership for a Healthier America, a nonprofit organization chaired by Michelle Obama.
GoWings Active Learning: Virtual Tutor Demo - YouTube
GoWings Safari, a safari-themed educational game, enabled with a Virtual Tutor that interacts with the user.
One of the students working on the project is a first-generation UC Berkeley graduate from Malawi named James Tayali. James said: “This safari game is important for kids who grow up in environments that expose them to childhood trauma and other negative experiences. Such kids struggle to pay attention and excel academically. Combining the educational experience with interactive, immersive games can improve their learning focus.”
As an orphan from Malawi who struggled to focus in school, this is an area that James can relate to. James had to juggle family issues and work on part-time jobs to support himself. Despite humble beginnings, James worked hard and attended UC Berkeley with scholarship support from the MasterCard Foundation. He is now paying it forward to the next generation of children. “This project can help children who share similar stories as me by helping them to let go of their past traumatic experiences, focus on their present education and give them hope for their future,” James added.
James Tayali (left), UC Berkeley Public Health major class of 2017 alum and Coleman
Fung (right), posing with the safari game shown on the monitor screen.
The fellowship program was taught by a Microsoft Search and Artificial Intelligence program manager, Chun Ming, who is also a UC Berkeley alum. He also advised the team that built the Virtual Tutor, which includes James Tayali, who majored in Public Health and served as team’s product designer; Luyi Huang, an Electrical Engineering & Computer Science (EECS) student who led the programming tasks; and Nicholas Kao, an Applied Math and Data Science student, who drove the data collection and analysis. Much of this work was done remotely across three locations – Redmond, WA, Berkeley, CA, and Austin, TX.
UC Berkeley Fung Fellowship students Luyi Huang (left) and Nicholas Kao (right).
Chun Ming teaching speech recognition and artificial intelligence lectures to UC Berkeley students.
Insights from the Virtual Tutor Project
This article shares technical insights from the team in a couple of areas:
Model selection strategy and engineering considerations for eventual real-world deployment, so others who are doing something similar have more confidence investing in a model upfront that fits their scenario right from the start.
Training data engineering techniques that are useful references not only for speech recognition, but also for other scenarios such as image recognition.
Model Selection – Selecting the Speech Recognition Model
We explored speech recognition technologies from Carnegie Mellon University (CMU), Google, Amazon, and Microsoft, and eventually zoomed into the following options:
1. Bing Speech Recognition Service
Microsoft Bing’s paid speech recognition service showed an accuracy of 100% although there was a 4-second wait to get results back from Bing’s remote servers. While the accuracy is impressive, we did not have the flexibility to adapt this black box model to other speech accents and background noise. One potential workaround is to process the output from the black box (i.e. post-processing).
2. CMU Open Source Statistical Model
We also explored other free, faster speech recognition models that are accessed locally rather than over a remote server. Eventually, we chose an open source statistical model, Sphinx, which had an initial accuracy of 85% and improved the latency from Bing Speech API’s 4 seconds down to 3 seconds. Unlike Bing’s black box solution, we can look inside the model to improve accuracy. For instance, we can reduce the search space of words needed for lookup in the dictionary or adapt the model with more speech training data. Sphinx has a 30-year-old legacy, originally developed by CMU researchers who are now coincidentally now at Microsoft Research (MSR) – they include Xuedong Huang, Microsoft Technical Fellow, Fil Alleva, Partner Research Manager, and Hsiao-Wuen Hon, Corporate Vice President for MSR Asia.
Pre-defined, human features and linguistic structure in CMU’s open source speech recognition model.
3. Azure Deep Learning Model
The students were also connected to the Boston-based Microsoft Azure team at the New England Research & Development (NERD) center. With access to NERD’s work on an Azure AI product known as the Data Science Virtual Machine notebook, the Fellowship students achieved a Virtual Tutor speech recognition accuracy of 91.9%. Moreover, the average model execution time is the same at 0.5 seconds per input speech file between NERD’s and CMU’s model. An additional prototype deep learning model was developed by NERD based on a winning solution of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. This model can push classification accuracy up even further and scales to larger training datasets.
Machine learned features with Azure’s deep learning model.
Plot of NERD’s model accuracy on the y-axis against the number of passes through the full
training data (i.e. epochs) on the x-axis. The final accuracy evaluated on a measurement set is 91.9%.
Training Data Engineering
The lack of audio training data was a hurdle in maximizing the potential of the deep learning model. More training data can always improve CMU’s speech recognition model.
1. Solve Training and Testing Data Mismatch Problems
We downloaded synthetic speaker audio files from the public web and collected audio files from UC Berkeley volunteers at sampling rate of 16kHz. Initially, we observed that more training data did not increase test accuracy on Oculus microphone. This problem was due to sampling rate mismatch between the training data (16kHz) and the Oculus microphone input (48kHz). Once the input was down sampled, the improved Sphinx mode had better accuracy.
Visual representation of the spectrum of the frequencies of sound varying with time (i.e. Spectrogram) comparison between 16kHz sampling (top) and 48kHz (bottom). Note bottom 48kHz sampled spectrogram has finer resolution.
2. Synthetic Speaker Audio
Data biases due to speaker’s gender and accents must be balanced out by increasing the quality and quantity of training data. To address this, we imported synthesized male/female audio samples from translators such as Bing. We could train our model using these new synthesized audio samples in conjunction with our current data. However, we found that synthesized audio lacked the naturally occurring zig zag feature variations in human voice. It was “too clean” to represent natural human voice accurately in live setting.
Spectrogram comparison between synthesized voice (top) and human speaker voice (bottom)
3. Background Noise and Speaker Signal Combination
Another problem is there are varying loud background noises that the Oculus microphone is unable to denoise automatically. This interferes with the model’s ability to differentiate background noise from the speaker’s signal. To solve this, we combined an audio sample with multiple background noises.
The y axis “Amplitude” is the normalized dB scale (where -1 is no signal
and +1 is the strongest signal) representing the loudness in the audio.
This provided a large amount of audio samples and allowed us to customize the model towards the Virtual Tutor’s live environment. With the additional synthesized samples, we trained a more accurate model as shown in the Confusion Matrix below. This matrix shows test examples where the model is confused due to mismatch between the predicted class column and ground truth row. Correct predictions are shown along the diagonal line of the matrix. The confusion matrix is a good way to visualize what classes require targeted training data improvements.
Confusion matrix before combining background noise and speaker signal as new training data for CMU’s Sphinx model. Model accuracy is 93%.
Confusion matrix after combining background noise and speaker signal as new training data for CMU’s Sphinx model. Model accuracy is 96%.
The synthesized noise had some problems. We discovered a few outliers when we overlaid (on time scale) the clean signal and synthesized noise without any signal adjustments. These outliers occurred because the noise was more prominent than the speaker’s signal.
4. Signal-to-Noise Ratio Optimization
To compensate for the above issue, we adjusted the relative decibel (dB) level ratios between the two overlaid audio files. Using Root Mean Square (RMS) to estimate the dB levels of each audio file, we were able suppress the noisy audio allowing the speakers voice to take precedence when training and predicting. Through a series of testing we determined that the average dB level of noise is about 70% of the average of our clean audio dB level. This allows us to keep up a 95% accuracy when tested on a redundant training and testing set. Anything higher than 80% decreases accuracy at an increasing rate.
Waveform plot showing noise at 70% (top) and 100% (bottom). The y axis “Amplitude” is the normalized
dB scale (where -1 is no signal and +1 is the strongest signal) representing the loudness in the audio.
Spectrogram plot showing noise at 70% (top) and 100% (bottom).
Note the bottom noise at 100% has more blue and pink regions.
The story of the UC Berkeley Virtual Tutor project began in the Fall of 2016. We first tested a variety of speech recognition technologies and then explored a range of training data engineering techniques. Currently, our speech recognition models have been integrated with the game and are being tested at middle schools in California.
For those of you looking to add speech recognition capabilities to your projects, you should consider the following options based on our findings:
For ease of integration and high accuracy, try the Bing Speech API. It lets you use 5000 free transactions per month.
Earlier this week, MIT, in collaboration with Boston Consulting Group, released their second global study looking at AI adoption in industry. A top finding of this report is that the leading companies in AI adoption are now convinced of the value of AI and are now facing the challenge of moving beyond individual point solutions toward broad, systematic use of AI across the company and at-scale.
In the report, Joseph Sirosh, CTO of AI at Microsoft, discusses how Microsoft is building a complete AI platform that empowers enterprises to implement these AI-first business models and do so at scale. Scaling AI across an entire business requires companies to look far beyond just building that initial model.
As Joseph says, companies need an “AI Oriented Architecture capable of constantly running AI experiments reliably, with continuous integration and development, and then learning from those experiments and continuing to improve its operations.”