Machine Learning Yearning is about structuring the development of machine learning projects. The book contains practical insights that are difficult to find somewhere else, in a format that is easy to share with teammates and collaborators. Most technical AI courses will explain to you how the different ML algorithms work under the hood, but this book teaches you how to actually use them. If you aspire to be a technical leader in AI, this book will help you on your way. Historically, the only way to learn how to make strategic decisions about AI projects was to participate in a graduate program or to gain experience working at a company. Machine Learning Yearning is there to help you quickly acquire this skill, which enables you to become better at building sophisticated AI systems.
Table of Contents
About the Author
Concept 1: Iterate, iterate, iterate…
Concept 2: Use a single evaluation metrics
Concept 3: Error analysis is crucial
Concept 4: Define an optimal error rate
Concept 5: Work on problems that humans can do well
Concept 6: How to split your dataset
About the Author
Andrew NG is a computer scientist, executive, investor, entrepreneur, and one of the leading experts in Artificial Intelligence. He is the former Vice President and Chief Scientist of Baidu, an adjunct professor at Stanford University, the creator of one of the most popular online courses for machine learning, the co-founder of Coursera.com and a former head of Google Brain. At Baidu, he was significantly involved in expanding their AI team into several thousand people.
The book starts with a little story. Imagine, you want to build the leading cat detector system as a company. You have already built a prototype but unfortunately, your system’s performance is not that great. Your team comes up with several ideas on how to improve the system, but you are confused about which direction to follow. You could build the worlds leading cat detector platform or waste months of your time following the wrong direction.
This book is there to tell you how you can decide and prioritize in a situation like this. According to Andrew NG, most machine learning problems will leave clues about the most promising next steps and about what you should avoid doing. He goes on explaining that learning to “read” those clues is a crucial skill in our domain.
In a nutshell, ML Yearning is about giving you a deep understanding of setting the technical direction of machine learning projects.
Since your team members could react skeptically when you propose new ideas of doing things, he made the chapters very short (1–2 pages), so that your team members could read it in a few minutes to understand the idea behind the concepts. If you are interested in reading this book, note that it is not suited for complete beginners, since it requires basic familiarity with supervised learning and deep learning.
In this post, I will share six concepts of the book in my own language out of my understanding.
Concept 1: Iterate, iterate, iterate…
NG emphasizes throughout the book that it is crucial to iterate quickly since machine learning is an iterative process. Instead of thinking about how to build the perfect ML system for your problem, you should build a simple prototype as fast as you can. This is especially true if you are not an expert in the domain of the problem since it is hard to correctly guess the most promising direction.
You should build a first prototype in just a few days and then clues will pop up that show you the most promising direction to improve the performance of the prototype. In the next iteration, you will improve the system based on one of these clues and build the next version of it. You will do this again and again.
He goes on explaining that the faster you can iterate, the more progress you will make. Other concepts of the book, build upon this principle. Note that this is meant for people who just want to build an AI-based application and not do research in the field.
Concept 2: Use a single evaluation metrics
This concept builds upon the previous one and the explanation about why you should choose a single-number evaluation metrics is very simple: It enables you to quickly evaluate your algorithm and therefore you are able to iterate faster. Using multiple evaluation metrics simply makes it harder to compare algorithms.
Imagine you have two algorithms. The first has a precision of 94% and a recall of 89%. The second has a precision of 88% and a recall of 95%.
Here, no classifier is obviously superior if you didn’t choose a single evaluation metrics, so you would probably have to spend some time to figure it out. The problem is, that you lose a lot of time for this task at every iteration and that it adds up over the long run. You will try a lot of ideas about architecture, parameters, features, etc. If you are using a single-number evaluation metric (such as precision or the f1-score), it enables you to sort all your models according to their performance, and quickly decide which one is working best. Another way of improving the evaluation process would be to combine several metrics into a single one, for example, by averaging multiple error metrics.
Nevertheless, there will be ML problems that need to satisfy more than one metric, like for example: taking running time into consideration. NG explains that you should define an “acceptable” running time, which enables you to quickly sort out the algorithms that are too slow and compare the satisfying ones with each other based on your single-number evaluation metrics.
In short, a single-number evaluation metrics enables you to quickly evaluate algorithms, and therefore to iterate faster.
Concept 3: Error analysis is crucial
Error analysis is the process of looking at examples where your algorithms output was incorrect. For example, imagine that your cat detector mistakes birds for cats and you already have several ideas on how to solve that issue.
With a proper error analysis, you can estimate how much an idea for improvement would actually increase the system’s performance, without investing months of your time on implementing this idea and realizing that it wasn’t crucial to your system. This enables you to decide which idea is the best to spend your resources on. If you find out that only 9% of the misclassified images are birds, then it does not matter how much you improve your algorithm’s performance on bird images, because it won’t improve more than 9% of your errors.
Furthermore, it enables you to quickly judge several ideas for improvement in parallel. You just need to create a spreadsheet and fill it out while examining, for example, 100 misclassified dev set images. On the spreadsheet, you create a row for every misclassified image and columns for every idea that you have for improvement. Then you go through every misclassified image and mark with which idea the image would have been classified correctly.
Afterward, you know exactly that, for example, with idea-1 the system would have classified 40 % of the miss-classified images correctly, idea-2 12%, and idea-3 only 9%. Then you know that working on idea-1 is the most promising improvement that your team should work on.
Also, once you start looking through these examples, you will probably find new ideas on how to improve your algorithm.
Concept 4: Define an optimal error rate
The optimal error rate is helpful to guide your next steps. In statistics, it is also often called the Bayes error rate.
Imagine that you are building a speech-to-text system and that you find out that 19% of the audio files, you expect users to submit, have so dominant background noises that even humans get can’t recognize what was said in there. If that’s the case, you know that even the best system would probably have an error of around 19%. In contrast, if you work on a problem with an optimal error rate of nearly 0%, you can hope that your system should do just as well.
It also helps you to detect if you are algorithm is suffering from high bias or variance, which helps you to define the next steps to improve your algorithm.
But how do we know what the optimal error rate is? For tasks that humans are good at, you can compare your system’s performance to those of humans, which gives you an estimate of the optimal error rate. In other cases, it is often hard to define an optimal rate, which is the reasons why you should work on problems that humans can do well, which we will discuss at the next concept.
Concept 5: Work on problems that humans can do well
Throughout the book, he explains several times why it is recommended to work on machine learning problems that humans can do well themselves. Examples are Speech Recognition, Image Classification, Object Detection and so on. This has several reasons.
First, it is easier to get or to create a labeled dataset, because it is straightforward for people to provide high accuracy labels for your learning algorithm if they can solve the problem by themselves.
Second, you can use human performance as the optimal error rate that you want to reach with your algorithm. NG explains that having defined a reasonable and achievable optimal error helps to accelerate the team’s progress. It also helps you to detect if your algorithm is suffering from high bias or variance.
Third, it enables you to do error analysis based on your human intuition. If you are building, for example, a speech recognition system and your model misclassifies its input, you can try to understand what information a human would be using to get the correct transcription, and use this to modify the learning algorithm accordingly. Although algorithms surpass humans at more and more of tasks that humans can’t do well themselves, you should try to avoid these problems.
To summarize, you should avoid these tasks because it makes it harder to obtain labels for your data, you can’t count on human intuition anymore, and it is hard to know what the optimal error rate is.
Concept 6: How to split your dataset
NG also proposes a way on how to split your dataset. He recommends the following:
Train Set: With it, you train your algorithm and nothing else.
Dev Set: This set is there to do hyperparameter tuning, to select and create proper features and to do error analysis. It is basically there to make decisions about your algorithm.
Test Set: The test set is used to evaluate the performance of your system, but not to make decisions. It’s just there for evaluation, and nothing else.
The dev set and test set allow your team to quickly evaluate how well your algorithm is performing. Their purpose is to guide you to the most important changes that you should make to your system.
He recommends choosing the dev and test set so that they reflect data which you want to do well on in the future once your system is deployed.This is especially true if you expect that the data will be different than the data you are training it on right now. For example, you are training on normal camera images but later on your system will only receive pictures taken by phones because it is part of a mobile app. This can be the case if you don’t have access to enough mobile phone photos to train your system. Therefore, you should pick test set examples that reflect what you want to perform well on later in reality, rather than the data that you used for training.
Also, you should choose dev and test sets that come from the same distribution. Otherwise, there is a chance that your team will build something that does well on the dev set, only to find that it performs extremely poor on the test data, which you care about the most.
In this post, you’ve learned about 6 concepts of Machine Learning Yearning. You now know, why it is important to iterate quickly, why you should use a single-number evaluation metrics and what errors analysis is about and why it is crucial. Also, you’ve learned about the optimal error rate, why you should work on problems that humans can do well and how you should split your data. Furthermore, you learned that you should pick the dev and test set data so that they reflect the data which you want to do well on in the future, and that dev and test sets should come from the same distribution. I hope that this post gave you an introduction to some concepts of the book and I can definitely say that it is worth reading.
This moment has been a long time coming. The technology behind speech recognition has been in development for over half a century, going through several periods of intense promise — and disappointment. So what changed to make ASR viable in commercial applications? And what exactly could these systems accomplish, long before any of us had heard of Siri?
The story of speech recognition is as much about the application of different approaches as the development of raw technology, though the two are inextricably linked. Over a period of decades, researchers would conceive of myriad ways to dissect language: by sounds, by structure — and with statistics.
Human interest in recognizing and synthesizing speech dates back hundreds of years (at least!) — but it wasn’t until the mid-20th century that our forebears built something recognizable as ASR.
1961 — IBM Shoebox
Among the earliest projects was a “digit recognizer” called Audrey, created by researchers at Bell Laboratories in 1952. Audrey could recognize spoken numerical digits by looking for audio fingerprints called formants — the distilled essences of sounds.
In the 1960s, IBM developed Shoebox — a system that could recognize digits and arithmetic commands like “plus” and “total”. Better yet, Shoebox could pass the math problem to an adding machine, which would calculate and print the answer.
Meanwhile, researchers in Japan built hardware that could recognize the constituent parts of speech like vowels; other systems could evaluate the structure of speech to figure out where a word might end. And a team at University College in England could recognize 4 vowels and 9 consonants by analyzing phonemes, the discrete sounds of a language.
But while the field was taking incremental steps forward, it wasn’t necessarily clear where the path was heading. And then: disaster.
October 1969 —The Journal of the Acoustical Society of America
A Piercing Freeze
The turning point came in the form of a letter written by John R. Pierce in 1969.
Pierce had long since established himself as an engineer of international renown; among other achievements he coined the word transistor (now ubiquitous in engineering) and helped launch Echo I, the first-ever communications satellite. By 1969 he was an executive at Bell Labs, which had invested extensively in the development of speech recognition.
In an open letter³ published in The Journal of the Acoustical Society of America, Pierce laid out his concerns. Citing a “lush” funding environment in the aftermath of World War II and Sputnik, and the lack of accountability thereof, Pierce admonished the field for its lack of scientific rigor, asserting that there was too much wild experimentation going on:
“We all believe that a science of speech is possible, despite the scarcity in the field of people who behave like scientists and of results that look like science.” — J.R. Pierce, 1969
Pierce put his employer’s money where his mouth was: he defunded Bell’s ASR programs, which wouldn’t be reinstated until after he resigned in 1971.
Thankfully there was more optimism elsewhere. In the early 1970s, the U.S. Department of Defense’s ARPA (the agency now known as DARPA) funded a five-year program called Speech Understanding Research. This led to the creation of several new ASR systems, the most successful of which was Carnegie Mellon University’s Harpy, which could recognize just over 1000 words by 1976.
Meanwhile, efforts from IBM and AT&T’s Bell Laboratories pushed the technology toward possible commercial applications. IBM prioritized speech transcription in the context of office correspondence, and Bell was concerned with ‘command and control’ scenarios: the precursors to the voice dialing and automated phone trees we know today.
Despite this progress, by the end of the 1970s ASR was still a long ways from being viable for anything but highly-specific use-cases.
This hurts my head, too.
The ‘80s: Markovs and More
A key turning point came with the popularization of Hidden Markov Models(HMMs) in the mid-1980s. This approach represented a significant shift “from simple pattern recognition methods, based on templates and a spectral distance measure, to a statistical method for speech processing”—which translated to a leap forward in accuracy.
A large part of the improvement in speech recognition systems since the late 1960s is due to the power of this statistical approach, coupled with the advances in computer technology necessary to implement HMMs.
HMMs took the industry by storm — but they were no overnight success. Jim Baker first applied them to speech recognition in the early 1970s at CMU, and the models themselves had been described by Leonard E. Baum in the ‘60s. It wasn’t until 1980, when Jack Ferguson gave a set of illuminating lectures at the Institute for Defense Analyses, that the technique began to disseminate more widely.
The success of HMMs validated the work of Frederick Jelinek at IBM’s Watson Research Center, who since the early 1970s had advocated for the use of statistical models to interpret speech, rather than trying to get computers to mimic the way humans digest language: through meaning, syntax, and grammar (a common approach at the time). As Jelinek later put it: “Airplanes don’t flap their wings.”
These data-driven approaches also facilitated progress that had as much to do with industry collaboration and accountability as individual eureka moments. With the increasing popularity of statistical models, the ASR field began coalescing around a suite of tests that would provide a standardized benchmark to compare to. This was further encouraged by the release of shared data sets: large corpuses of data that researchers could use to train and test their models on.
In other words: finally, there was an (imperfect) way to measure and compare success.
November 1990, Infoworld
Consumer Availability — The ‘90s
For better and worse, the 90s introduced consumers to automatic speech recognition in a form we’d recognize today. Dragon Dictate launched in 1990 for a staggering $9,000, touting a dictionary of 80,000 words and features like natural language processing (see the Infoworld article above).
These tools were time-consuming (the article claims otherwise, but Dragon became known for prompting users to ‘train’ the dictation software to their own voice). And it required that users speak in a stilted manner: Dragon could initially recognize only 30–40 words a minute; people typically talk around four times faster than that.
But it worked well enough for Dragon to grow into a business with hundreds of employees, and customers spanning healthcare, law, and more. By 1997 the company introduced Dragon NaturallySpeaking, which could capture words at a more fluid pace — and, at $150, a much lower price-tag.
Even so, there may have been as many grumbles as squeals of delight: to the degree that there is consumer skepticism around ASR today, some of the credit should go to the over-enthusiastic marketing of these early products. But without the efforts of industry pioneers James and Janet Baker (who founded Dragon Systems in 1982), the productization of ASR may have taken much longer.
November 1993, IEEE Communications Magazine
Whither Speech Recognition— The Sequel
25 years after J.R. Pierce’s paper was published, the IEEE published a follow-up titled Whither Speech Recognition: the Next 25 Years⁵, authored by two senior employees of Bell Laboratories (the same institution where Pierce worked).
The latter article surveys the state of the industry circa 1993, when the paper was published — and serves as a sort of rebuttal to the pessimism of the original. Among its takeaways:
The key issue with Pierce’s letter was his assumption that in order for speech recognition to become useful, computers would need to comprehend what words mean. Given the technology of the time, this was completely infeasible.
In a sense, Pierce was right: by 1993 computers had meager understanding of language—and in 2018, they’re still notoriously bad at discerning meaning.
Pierce’s mistake lay in his failure to anticipate the myriad ways speech recognition can be useful, even when the computer doesn’t know what the words actually mean.
The Whither sequel ends with a prognosis, forecasting where ASR would head in the years after 1993. The section is couched in cheeky hedges (“We confidently predict that at least one of these eight predictions will turn out to have been incorrect”) — but it’s intriguing all the same. Among their eight predictions:
“By the year 2000, more people will get remote information via voice dialogues than by typing commands on computer keyboards to access remote databases.”
“People will learn to modify their speech habits to use speech recognition devices, just as they have changed their speaking behavior to leave messages on answering machines. Even though they will learn how to use this technology, people will always complain about speech recognizers.”
The Dark Horse
In a forthcoming installment in this series, we’ll be exploring more recent developments and the current state of automatic speech recognition. Spoiler alert: neural networks have played a starring role.
But neural networks are actually as old as most of the approaches described here — they were introduced in the 1950s! It wasn’t until the computational power of the modern era (along with much larger data sets) that they changed the landscape.
But we’re getting ahead of ourselves. Stay tuned for our next post on Automatic Speech Recognition by following Descript on Medium, Twitter, or Facebook.
Connectionist Temporal Classification (CTC) is a valuable operation to tackle sequence problems where timing is variable, like Speech and Handwriting recognition. Without CTC, you would need an aligned dataset, which in the case of Speech Recognition, would mean that every character of a transcription, would need to be aligned to its exact location in the audio file. Therefore, CTC makes training such a system a lot easier.
Table of Contents:
Recurrent Neural Networks
Basics RNNs and Speech Recognition
Connectionist Temporal Classification
Recurrent Neural Networks
This post requires knowledge about Recurrent Neural Networks. If you aren’t familiar with this kind of Neural Networks, I encourage you to check out my article about them (click here).
Nevertheless, I will give you a little recap:
RNNs are the state of the art algorithm for sequential data. They have an internal time-dependent state (memory) due to the so-called “feedback-loop”. Because of that, they are good at predicting what’s coming next based on what happened before. This makes them well suited for problems that involve sequential data like speech recognition, handwriting recognition and so on.
To better understand RNNs, we have to look at what makes them different than a usual (feedforward) Neural Network. Take a look at the following example.
Imagine that we put the word „SAP“ into a feedforward Neural Network (FFNN) and into a Recurrent Neural Network (RNN) and that they would process it one character after the other. By the time, the FFNN reaches the letter „P“, it has already forgotten about „S“ and „A“, but a RNN doesn’t. This is due to the different way they process information, that the two image below illustrates.
In a FFNN, the information only moves in one direction (from input, through the hidden layers, through the output layers) and therefore, the information never touches a node twice. This is the reason why feedforward Neural Networks can’t remember what they received as input – they only remember the data they are trained upon.
In contrast, a RNN cycles information through a loop. When making a decision, a RNN takes into consideration the current input and also what it has learned from the previous inputs.
Basic RNNs and Speech Recognition
Speech Recognition is the task of translating spoken language into text by a computer. The problem is that you have an acoustic observation (some recording as an audio file) and you want to have a transcription of it, without manually creating it.
You know, that Recurrent Neural Networks are well suited for tasks that involve sequential data. And because speech is sequential data, in theory, you could train a RNN with a dataset of acoustic observations and their corresponding transcriptions. But like you probably already guessed, it isn’t that easy. That is because basic, also called canonical Recurrent Neural Networks require aligned data. This means that each character (not each word!) needs to be aligned to its location in the audio file. Just take a look at the image below and imagine that every character would need to be aligned to its exact location.
Of course, this is tedious work that no one wants to do and only a few organizations could afford to hire enough people to do that, which is the reason why there are very few datasets like this out there.
Connectionist Temporal Classification
Fortunately, we have Connectionist Temporal Classification (CTC), which is a way around not knowing the alignment between the input and the output. CTC is simply a loss function that is used to train Neural Networks, like Cross-Entropy and so on. It is used at problems, where having aligned data is an issue, like Speech Recognition.
Like I said, with CTC, there is no need for aligned data. That is because it can assign a probability for any label, given an input. This means it only requires an audio file as input and a corresponding transcription. But how can CTC assign a probability for any label, just given an input? Like I said, CTC is „alignment-free“. It works by summing over the probability of all possible alignments between the input and the label. To understand that, take a look at this naive approach:
Here we have an input of size 9 and the correct transcription of it is „Iphone“. We force our system to assign an output character to each input step and then we collapse the repeats, which results in the output. But this approach has two problems:
It does not make sense to force every input step to be aligned to some output, because we also need to account for silence within the input. There is no way to produce words as output that have two characters in a row, like a word „Kaggle“. If we use this approach, we could only produce “Kagle” as output.
There is a way around that, called the „blank token“. It does not mean anything and it simply gets removed before the final output is produced. Now, the algorithm can assign a character or the blank token to each input step. There is one rule: To allow double characters in the output, a blank token must be between them. With this simple rule, we can also produce output with two characters in a row.
Here is an illustration of how this works:
1.) The CTC network assigns the character with the highest probability (or no character) to every input sequence.
2.) Repeats without a blank token in between get merged.
3.) and lastly, the blank token gets removed.
The CTC network can then give you the probability of a label for a given input, by summing over all the probabilities of the character for each time-step.
In this post, we’ve reviewed the main facts about Recurrent Neural Networks and we learned why canonical RNNs aren’t really suited for sequence problems. Then we discussed what Connectionist Temporal Classification is, how it works and why it enables RNNs to tackle tasks like Speech Recognition.
Software project management is the practice of planning and executing software projects. Its concepts need to be understood by every team member to ensure a smooth project flow. There are different methodologies that can be mainly divided into structured and flexible approaches. The most common approach, which gained a lot of popularity in recent years, is called “Agile”. This is a flexible approach based on delivering requirements iteratively and incrementally throughout the project life cycle. This post, will give you a gentle introduction to agile and non-agile project management approaches with the focus on the Scrum Methodology.
Table of Contents:
The Waterfall Methodology
Agile & the Agile Manifesto
The Scrum Methodology
The Waterfall Methodology
The Waterfall Methodology is one of the oldest and most traditional methods to manage the development of software applications. It splits the Software Development Lifecycle (SDLC) into 6 different stages. Waterfall is a linear approach, where you can only proceed to the next stage if the current stage is completely finished. Because of that it is called the Waterfall Methodology. If you would want to go back to a previous stage, for example, if you want to change the design during the deployment phase, you would need to go through every stage again that comes after the design stage.
Although, the Agile approach is more commonly applied in the industry than the Waterfall Methodology, it is still used because implementing a Waterfall model is a straightforward process, due to its step by step nature.
You can see the six stages below:
I. Requirement Analysis
Here, the requirements for the application are analyzed and documented. These documents are the baseline for creating the software and the requirements are split into functional- and un-functional requirements. Like at every of those 6 stages, the stage gets reviewed & signed off before proceeding to the next phase.
II. System Design
At this stage, the design team creates a blueprint for the software, using the requirement documents of stage one. They are creating high- and low-level design documents, which also get revised and signed off before proceeding to the next phase.
In the implementation stage, developers convert the designs into actual software. This stage will result in a working software program.
Here, the software gets tested to ensure that all requirements are fulfilled and that it is working flawlessly. In addition, this helps to identify errors or missing requirements in contrary to the actual requirements from stage 1. Testing, also provides stakeholders with information about the quality of the software.
In this stage, the software gets transformed into a real world application. Software deployment is all of the activities that make a software system available for use. This involves preparing the application to run and operate in a specific environment, making it scaleable and optimize its performance. This can be done either manually or through automated systems. Because every software is unique, this is more a general process that has to be customized according to the specific requirements and characteristics.
In the Maintenance stage, the software gets passed to the maintenance team, which regularly updates it and fixes issues. They also develop new functionality enhancements to the system, to further improve performance or other attributes.
This is how the Waterfall Model works. Like I said, the problem with the Waterfall Methodology is that if any requirements change, the team would have to move back to the requirement analysis stage & has to go through all stages again. The same would happen if the design or somethings else would change. This makes it difficult and inconvenient to make changes afterwards. The design and development processes are not sequential most of the time, because new requirements can surface at any point necessitating changes in design, which in turn results in new development tasks. Another disadvantage is that no working software is produced till the mid/late stages. This is a problem because if the investors decide to shut down the project, you would have achieved 60% of the whole project but you would not even have a single line of code. This would be different with the agile approach, which will be discussed in the next section. Because of these issues, Waterfall can be a risky and uncertain approach, depending on the project. Therefore it is not well suited for large and complex projects.
Its advantages are that it is easy to understand and implement, that it’s phases don’t overlap, that it works good for relatively small projects and that it is easy to manage.
Agile Software DevelopmentWhat is Agile Development?
The term „Agile“ comprises different software development methods that work all very similar and concurrent to the Agile manifesto. Note that Agile and its corresponding methods are not as prescriptive as it may appear. Most of the time, the individual teams figure out what works for them and adjust the Agile system accordingly. Teams that are working Agile, are cross-functional, self organized and take working software as their measure of progress.
Agile methods are all iterative approaches that build software incrementally, instead of trying to deliver all at once near the end. Agile works by breaking down project requirements into little bits of user-functionalities, which are called user stories. These are then prioritized and delivered continuously in short cycles which are called iterations. To receive the highest customer satisfaction, working agile means to set a focus on quality. In agile teams, development team members have the responsibility to solve problems, organize and assign tasks and create a code architecture that is modular, flexible, extensible and suits the nature of team.
The Agile Manifesto
In 2001, software and development experts created a statement of values for successful software projects, which is known as the agile manifesto. Agile methods are all based on the four main values of this Manifesto, which you can see below.
We are uncovering better ways of developing software by doing it and helping others do it. Through this work, we have come to the value:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right, we value the items on the left more.
Scrum is one of the most used and well known agile frameworks out there. At its foundation, Scrum can be applied to nearly any project.
First of all, in Scrum the project team is separated into three different roles. These are the following: The product Owner, who is responsible for the product that will be created. The Scrum Master, who is responsible for the implementation and compliance of the Scrum rules and processes within the company. And lastly, the Development Team, which are the actual programmers and experts that create the product.
Secondly, Scrum has 5 events that ensure transparency, inspection and adaption. These events are: The Sprint, which is at the heart of scrum. A sprint is a cycle that should always be equally long and should not take longer than four weeks. Every sprint has clear goals about what should be accomplished during its timeframe and the sprints progress is tracked at the scrum board. This board is split into „To Do“, „Build“, „Test“ and „Done“, where the tasks are placed as a sticker and moved from „Todo“ on till it reaches „Done“. You can see such a board below.
At the end of every sprint, a new functionality for the final product should be finished and delivered to the customer. Then there is the Sprint Planning, where the goals for every sprint are set. This is of course done before a sprint starts. There is also the daily Scrum, where the last 24 hours are discussed an the next 24hours are planned. This takes about 15 minutes. There is also the sprint review, in which the last sprint gets evaluated and the developed functionality (product increment) gets tested. This happens at the end of every sprint. Lastly, there is the sprint retrospective, which is also done after every sprint, which has the goal of improving the sprints in general.
To be more clearly, here is the general procedure of a sprint:
Each sprint starts with a Sprint Planning, which is facilitated by the Scrum Master and attended by the Product Owner and the Development Team and (optionally) other stakeholders. They sit down together and select high priority items from the Product Backlog that the Development Team can commit to a single delivery in a single sprint. The selected items are known as the Sprint Backlog. The Development Team works on the items in the Sprint Backlog only for the duration of the Sprint. New issues usually must wait for the next sprint. The Daily Scrum is a short standup meeting attended by the Scrum Master, the Product Owner and the Development Team. A review of the sprint often includes a demo to stakeholders. An examination of what went well, what could be improved, etc. The goal is to make each Sprint more efficient and effective than the last. At the end of the Sprint, completed items are packaged for the release. (Note that some teams release more often than this.) Any incomplete items are returned to the Product Backlog.
1.) Product Owner
The Product Owner is always a single person, which is fully responsible for the product. He writes the requirements that the final product has to fulfill into the so called „Product Backlog“, which we will discuss later on.
The Product Owner basically represent to some degree the customers of the product and therefore represents their needs. But this is only one part of his role. He also represents the bridge between the actual customer and the development team. Therefore, he is responsible for quality assurance of the final product. He is also responsible for considering the needs and ideas of the development team, so that everyone feels comfortable and knows the his opinion is taken into account in regards to the Product Backlog.
2.) Scrum Master
It is the job of the Scrum Master to help the Product Owner and the Development Team to develop and maintain good habits, that are inline with the Scrum Methodology.
The Scrum Master is responsible for the implementation and compliance of the Scrum rules and processes within the company. This involves making sure that every person understands the Scrum framework and works inline with it. Basically, he improves the overall understanding of Scrum within the company. If the companies people understand Scrum 100% correctly, he would be nearly unnecessary. The Scrum Master is also acting as a bridge between the scrum team and the stakeholders. He isn’t part of any hierarchy within the company and has therefore a good general overview. He also helps the product owner with planning and executing of scrum events. On top of that, he can help the development team with the so called „Sprint Backlog“, which we will discuss later on. He also moderates the Sprint Planning, which is also attended by the Product Owner and the development team. He also moderates the daily scrum meeting.
3.) The Development Team
The Development Team consists of the developers and technical experts who are responsible for creating the product increment of a sprint (we will also discuss this later on). Like I said, in Agile and therefore also in Scrum, the teams are self organized. The Development Team is also responsible for creating the „Sprint Backlog and updating it daily. During a sprint, the development is constantly working at achieving the goals for the current sprint. The team meets daily for the so called „Daily Scrum“, where the last 24 hours are discussed an the next 24hours are planned. This takes about 15 minutes. It belong also to the responsibility of the Development Team to present the new product increment during a „Sprint Review“. If an expectation is not fulfilled, which only the Product Owner can decide, the whole team is responsible for it. The Development Team is also part of the Sprint Retrospective.
Scrum is based on empirical process management, which has three main parts. These are the following:
Transparency: Everyone involved in an agile project knows what is going on at the moment & the how the overall progress is going. E.g. Everyone is completely transparent about what he is doing.
Inspection: Everyone involved in the project should frequently evaluate the product. For example, the team openly and transparently shows the product at the end of each sprint.
Adaption: This means having the ability to adapt based on the results of the inspection.
Now we will discuss the three different Artifacts of Scrum. An Artifact is a tangible by-product produced during the product development.
1.) Product Backlog
The Product Backlog is an ordered list of all the possible requirements for the product. The Product Owner is responsible for the Product Backlog, for its access, its content, the order of its contents and for the prioritization. Note that every product has only one product backlog and only one Product Owner, even if there are many development teams. It is important to understand that a product backlog can never be „complete“, since it is constantly evolving along with the product. Typical entries for it are, description, order and priority of a functionality and the value that a functionality provides for the product. Since the markets and technology in general are constantly evolving, the requirements for the product and therefore the product backlog are constantly changing. The Scrum Team (Product Owner & Development Team) are improving the Backlog constantly together, which is of course a continuous process.
The priority of a product requirement within the Product Backlog plays an important role, because the higher it is, the more clearly and detailedly the requirement is formulated. This is because these are the ones that will soon move to the sprint backlog and therefore need to be clear enough.
2.) Sprint Backlog
The Sprint Backlog consists of the Product Backlog entries with the highest priority in regards to the currently available capacities within the company.
It is the plan of which functionalities are included in the next product increment. It is basically the goal plan of the things that the development team needs to achieve within the next sprint. Therefore it includes the required tasks that need to be done and the criteria the functionalities need to fulfill. Its goal is also to be so clearly to make the teams progress visible during the daily scrum meeting. Therefore it is also constantly evolving and gets regularly updated. The members of the development team are the only ones who are allowed to work on the sprint backlog, which is done during their normal work. During the sprint review, the sprint backlog is used to verify the goals.
3.) Product Increment
At the end of every sprint review, the product increment gets evaluated. It has to fulfill all the requirements from all previous sprints, which means it has to be compatible with them. This is because every product increment is just one part of the whole final product and therefore has to match to the other parts. It is..
Natural language processing (NLP) is an area of computer science and artificial intelligence that is concerned with the interaction between computers and humans in natural language. The ultimate goal of NLP is to enable computers to understand language as well as we do. It is the driving force behind things like virtual assistants, speech recognition, sentiment analysis, automatic text summarization, machine translation and much more. In this post, you will learn the basics of natural language processing, dive into some of its techniques and also learn how NLP benefited from the recent advances in Deep Learning.
Table of Contents:
Why NLP is difficult
Syntactic and Semantic Analysis
Named entity recognition
Deep Learning and NLP
Natural Language Processing (NLP) is the intersection of Computer Science, Linguistics and Machine Learning that is concerned with the communication between computers and humans in natural language. NLP is all about enabling computers to understand and generate human language. Applications of NLP techniques are Voice Assistants like Alexa and Siri but also things like Machine Translation and text-filtering. NLP is one of the fields that heavily benefited from the recent advances in Machine Learning, especially from Deep Learning techniques. The field is divided into the three following parts:
Speech Recognition – The translation of spoken language into text.
Natural Language Understanding – The computers ability to understand what we say.
Natural Language Generation – The generation of natural language by a computer.
II. Why NLP is difficult
Human language is special for several reasons. It is specifically constructed to convey the speaker/writers meaning. It is a complex system, although little children can learn it pretty quickly. Another remarkable thing about human language is that it is all about symbols. According to Chris Manning (Machine Learning Professor at Stanford University), it is a discrete, symbolic, categorical signaling system. This means that you can convey the same meaning by using different ways, like speech, gesture, signs etc. The encoding of these by the human brain is a continuous pattern of activation, where the symbols are transmitted via continuous signals of sound and vision.
Understanding human language is considered a difficult task due to its complexity. For example, there is an infinite number of different ways to arrange words in a sentence. Also, words can have several meanings and contextual information is necessary to correctly interpret sentences. Every Language is more or less unique and ambiguous. Just take a look at the following newspaper headline „The Pope’s baby steps on gays“. This sentence clearly has two very different interpretations, which is a pretty good example of the challenges in NLP.
Note that a perfect understanding of language by a computer would result in an AI that can process the whole information that is available on the internet, which in turn would probably result in artificial general intelligence.
III. Syntactic & Semantic Analysis
Syntactic Analysis (Syntax) and Semantic Analysis (Semantic) are the two main techniques that lead to the understanding of natural language. Language is a set of valid sentences, but what makes a sentence valid? Actually, you can break validity down into two things: Syntax and Semantics. The term Syntax refers to the grammatical structure of the text whereas the term Semantics refers to the meaning that is conveyed by it. However, a sentence that is syntactically correct, does not have to be semantically correct. Just take a look at the following example. The sentence “cows flow supremely” is grammatically valid (subject – verb – adverb) but does not make any sense.
Syntactic Analysis, also named Syntax Analysis or Parsing is the process of analyzing natural language conforming to the rules of a formal grammar. Grammatical rules are applied to categories and groups of words, not individual words. Syntactic Analysis basically assigns a semantic structure to text.
For example, a sentence includes a subject and a predicate where the subject is a noun phrase and the predicate is a verb phrase. Take a look at the following sentence: “The dog (noun-phrase) went away (verb-phrase)”. Note that we can combine every noun phrase with a verb phrase. Like I already mentioned, sentences that are formed like that doesn’t really have to make sense although they are syntactically correct.
For us as humans, the way we understand what someone has said is an unconscious process that relies on our intuition and our knowledge about language itself. Therefore, the way we understand language is heavily based on meaning and context. Since computers can not rely on these techniques, they need a different approach. The word “Semantic” is a linguisticterm and means something related to meaning or logic.
Therefore, Semantic Analysis is the process of understanding the meaning and interpretation of words, signs, and sentence structure. This enables computers partly to understand natural language the way humans do, involving meaning and context. I say partly because Semantic Analysis is one of the toughest parts of NLP and not fully solved yet. For example, Speech Recognition has become very good and works almost flawlessly but we are still lacking this kind of proficiency in Natural Language Understanding (e.g Semantic). Your phone basically understands what you have said but often can’t do anything with it because it doesn’t understand the meaning behind it. Also, note that some of the technologies out there only make you think they understand the meaning of a text. An approach based on keywords or statistics or even pure machine learning may be using a matching or frequency technique for clues as to what a text is “about.” These methods are limited because they are not looking at the real underlying meaning
IV. Techniques to understand Text
In the following; we will discuss many of the most popular techniques that are used for Natural language Processing. Note that some of them are closely intertwined and only serve as subtasks to solve larger problems.
What is Parsing? Let’s, first of all, look into the dictionary:
“resolve a sentence into its component parts and describe their syntactic roles.”
That actually nailed it but it could be a little bit more comprehensive. Parsing refers to the formal analysis of a sentence by a computer into its constituents, which results in a parse tree that shows their syntactic relation to each other in visual form, which can be used for further processing and understanding.
Below you can see a parse tree of the sentence „The thief robbed the apartment“, along with a description of the three different information types conveyed by it.
If we look at the letters directly above the single words, we can see that they show the part of speech of each word (noun, verb, and determiner). If we look one level higher, we see some hierarchical grouping of words into phrases. For example, „the thief“ is a noun phrase, „robbed the apartment“ is a verb phrase and all together, they form a sentence, which is marked one level higher.
But what is actually meant by a Noun- or Verb-Phrase? Let’s explain this with the example of „Noun Phrase“. These are phrases of one or more words that contain a noun and maybe descriptive words, verbs or adverbs. The idea is to group nouns with words that are in relation to them.
A parse tree also provides us with information about the grammatical relationships of the words due to the structure of their representation. For example, we can see in the structure that „the thief“ is the subject of „robbed“.
With structure I mean that we have the verb („robbed“), which is marked with a „V“ above it and a „VP“ above that, which is linked with a „S“ to the subject „the thief“, which has a „NP“ above. This is like a template for a subject-verb relationship and there are many others for other types of relationships.
Stemming is a technique that comes from morphology and information retrieval which is used in NLP for preprocessing and efficiency purposes. But let us first look into the dictionary what stemming actually means:
– “originate in or be caused by.”
Basically, Stemming is the process of reducing words to their word stem but what is actually meant by stem? A „stem“ is that part of a word that remains after the removal of all affixes. So for example, if you take a look at the word „touched“, it’s stem would be „touch“. „Touch“ is also the stem of „touching“ and so on.
You may be asking yourself, why do we even need the stem? The stem is needed because you are going to encounter different variations of words that actually have the same stem and the same meaning. Let’s take a look at an example of two sentences:
# I was taking a ride in the car
# I was riding in the car.
These two sentences mean the exact same thing and the use of the word is identical.
Now, imagine all the English words in the vocabulary with all their different fixations at the end of them. To store them all would require a huge database that would contain many words that actually mean the same. This is solved by focusing only on a word’s stem, through stemming. Popular algorithms are for example the „Porter Stemming Algorithm“ from 1979, which works pretty good.
Text Segmentation in NLP is the process of transforming text into meaningful units which can be words, sentences, different topics, the underlying intent and much more. Mostly, the text is segmented into its component words, which can be a difficult task, depending on the language. This is again due to the complexity of human language. For example, it works relatively well in English to separate words by spaces, except for words like „ice box“ that belong together but are separated by a space. The problem is that people sometimes also write it as „ice-box“.
Named Entity Recognition
Named Entity Recognition (NER) concentrates on determining which items in a text („named entities“) can be located and classified into pre-defined categories. These categories can range from the names of persons, organization, locations to monetary values and percentages.
Just take a look at the following example:
Before NER: „Martin bought 300 shares of SAP in 2016.“
After NER: „[Martin]Person bought 300 shares of [SAP]Organization in Time.“
Relationship Extraction takes the named entities of „Named Entity Recognition“ and tries to identify the semantic relationships between them. This could mean for example finding out who is married to whom, that a person works for a specific company and so on. This problem can also be transformed into a classification problem where you can train a Machine Learning model for every relationship type.
With Sentiment Analysis, we want to determine the attitude (e.g the sentiment) of, for example, a speaker or writer with respect to a document, interaction, or event. Therefore it is a natural language processing problem where text needs to be understood, to predict the underlying intent. The sentiment is mostly categorized into positive, negative and neutral categories. With the use of Sentiment Analysis, we want to predict for example a customers opinion and attitude about a product based on a review he wrote about it. Because of that, Sentiment Analysis is widely applied to things like reviews, surveys, documents and much more.
Now we know a lot about Natural Language Processing but the question that remains is, how do we use Deep Learning in NLP.
Central to Deep Learning and Natural Language is „word meaning“, where a word and especially it’s meaning are represented as a vector of real numbers. So with these vectors that represent words, we are placing words in a high-dimensional space. The interesting thing about this is, that the words, which are represented by vectors, will act as a semantic space. This simply means that words that are similar and have a similar meaning tend to cluster together in this high-dimensional vector space. You can see a visual representation of word meaning below:
You can find out what a group of clustered words mean by doing Principal Component Analysis (PCA) or Dimensionality Reduction with T-SNE but this can be misleading sometimes because they oversimplify and leave a lot of information on the side. Therefore, this is a good way to start (like logistic or linear regression in Data Science) but it isn’t cutting edge and it is possible to do it way better.
We can also think of parts of words as vectors which represent their meaning. Imagine the word „undesirability“. Using a „morphological approach“, which involves the different parts a word has, we would think of it as being made out of morphemes (word-parts) like this: „Un + desire + able + ity“. Every morpheme gets its own vector. This allows us to built a Neural Network out of that, which can compose the meaning of a larger unit, which in turn is made up of all of these morphemes.
Deep Learning can also make sense of the structure of sentences, by creating Syntactic Parsers that can figure out the structure of sentences. Google uses dependency parsing techniques like this, although in a more complex and larger manner, at their „McParseface“ and „SyntaxNet“.
By knowing the structure of sentences, we can start trying to understand the meaning of sentences. Like we already discussed, we start off with the meaning of words being vectors but we can also do this with whole phrases and sentences, where their meaning is also represented as vectors. And if we want to know the relationship of or between sentences, we train a Neural Network to make these decisions for us.
Deep Learning works also good at Sentiment Analysis. Just look at the following movie-review: „This movie does not care about cleverness, with or any other kind of intelligent humor“ A traditional approach would have fallen into the trap of thinking this is a positive review, because „cleverness or any other kind of intelligent humor“ sounds like a positive intent but a Neural Network would have recognized its real meaning. Other applications are Chatbots, Machine Translation, Siri, Google Inboxes suggested replies and so on.
There also have been huge advancements in Machine Translation through the rise of Recurrent Neural Networks, about which I also wrote a blog-post.
In Machine Translation done by Deep Learning algorithms, language is translated by starting with a sentence and generating vector representations that represent it. Then it starts to generate words in another language that entail the same information.
To summarize, NLP in combination with Deep Learning is all about vectors that represent words, phrases etc. and also to some degree their meanings.
In this post, you’ve learned a lot about Natural Language Processing. Now you know why NLP is such a difficult thing and why a perfect language understanding would probably result in artificial general intelligence. We’ve discussed the difference between syntactic and semantic analysis and learned about some NLP techniques that enable us to analyze and generate language. To summarize, the techniques we’ve discussed were parsing, stemming, text segmentation, named entity recognition, relationship extraction, and sentiment analysis. On top of that, we’ve discussed how deep learning managed to accelerate NLP by the concept of representing words, phrases, sentences and so on as numeric vectors.
Logistic Regression is one of the most used Machine Learning algorithms for binary classification. It is a simple Algorithm that you can use as a performance baseline, it is easy to implement and it will do well enough in many tasks. Therefore every Machine Learning engineer should be familiar with its concepts. The building block concepts of Logistic Regression can also be helpful in deep learning while building neural networks. In this post, you will learn what Logistic Regression is, how it works, what are advantages and disadvantages and much more.
Table of contents:
What is Logistic Regression?
How it works
Logistic VS. Linear Regression
Advantages / Disadvantages
When to use it
Other Classification Algorithms
What is Logistic Regression?
Like many other machine learning techniques, it is borrowed from the field of statistics and despite its name, it is not an algorithm for regression problems, where you want to predict a continuous outcome. Instead, Logistic Regression is the go-to method for binary classification. It gives you a discrete binary outcome between 0 and 1. To say it in simpler words, it’s outcome is either one thing or another.
A simple example of a Logistic Regression problem would be an algorithm used for cancer detection that takes screening picture as an input and should tell if a patient has cancer (1) or not (0).
How it works
Logistic Regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.
These probabilities must then be transformed into binary values in order to actually make a prediction. This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function is an S-shaped curve that can take any real-valued number and map it into a value between the range of 0 and 1, but never exactly at those limits. This values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier.
The picture below illustrates the steps that logistic regression goes through to give you your desired output.
Below you can see how the logistic function (sigmoid function) looks like:
We want to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. You can maximize the likelihood using different methods like an optimization algorithm. Newton’s Method is such an algorithm and can be used to find maximum (or minimum) of many different functions, including the likelihood function. Instead of Newton’s Method, you could also use Gradient Descent.
Logistic VS. Linear Regression
You may be asking yourself what the difference between logistic and linear regression is. Logistic regression gives you a discrete outcome but linear regression gives a continuous outcome. A good example of a continuous outcome would be a model that predicts the value of a house. That value will always be different based on parameters like it’s size or location. A discrete outcome will always be one thing (you have cancer) or another (you have no cancer).
Advantages / Disadvantages
It is a widely used technique because it is very efficient, does not require too many computational resources, it’s highly interpretable, it doesn’t require input features to be scaled, it doesn’t require any tuning, it’s easy to regularize, and it outputs well-calibrated predicted probabilities.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. Therefore Feature Engineering plays an important role in regards to the performance of Logistic and also Linear Regression. Another advantage of Logistic Regression is that it is incredibly easy to implement and very efficient to train. I typically start with a Logistic Regression model as a benchmark and try using more complex algorithms from there on.
Because of its simplicity and the fact that it can be implemented relatively easy and quick, Logistic Regression is also a good baseline that you can use to measure the performance of other more complex Algorithms.
A disadvantage of it is that we can’t solve non-linear problems with logistic regression since it’s decision surface is linear. Just take a look at the example below that has 2 binary features from 2 examples.
It is clearly visible that we can’t draw a line that separates these 2 classes without a huge error. To use a simple decision tree would be a much better choice.
Logistic Regression is also not one of the most powerful algorithms out there and can be easily outperformed by more complex ones. Another disadvantage is its high reliance on a proper presentation of your data. This means that logistic regression is not a useful tool unless you have already identified all the important independent variables. Since its outcome is discrete, Logistic Regression can only predict a categorical outcome. It is also an Algorithm that is known for its vulnerability to overfitting.
When to use it
Like I already mentioned, Logistic Regression separates your input into two „regions” by a linear boundary, one for each class. Therefore it is required that your data is linearly separable, like the data points in the image below:
In other words: You should think about using logistic regression when your Y variable takes on only two values (e.g when you are facing a classification problem). Note that you could also use Logistic Regression for multiclass classification, which will be discussed in the next section.
Out there are algorithms that can deal by themselves with predicting multiple classes, like Random Forest classifiers or the Naive Bayes Classifier. There are also algorithms that can’t do that, like Logistic Regression, but with some tricks, you can predict multiple classes with it too.
Let’s discuss the most common of these “tricks” at the example of the MNIST Dataset, which contains handwritten images of digits, ranging from 0 to 9. This is a classification task where our Algorithm should tell us which number is on an image.
1) one-versus-all (OvA)
With this strategy, you train 10 binary classifiers, one for each number. This simply means training one classifier to detect 0s, one to detect 1s, one to detect 2s and so on. When you then want to classify an image, you just look at which classifier has the best decision score
2) one-versus-one (OvO)
Here you train a binary classifier for every pair of digits. This means training a classifier that can distinguish between 0s and 1s, one that can distinguish between 0s and 2s, one that can distinguish between 1s and 2s etc. If there are N classes, you would need to train NxN(N-1)/2 classifiers, which are 45 in the case of the MNIST dataset.
When you then want to classify images, you need to run each of these 45 classifiers and choose the best performing one. This strategy has one big advantage over the others and this is, that you only need to train it on a part of the training set for the 2 classes it distinguishes between. Algorithms like Support Vector Machine Classifiers don’t scale well at large datasets, which is why in this case using a binary classification algorithm like Logistic Regression with the OvO strategy would do better, because it is faster to train a lot of classifiers on a small dataset than training just one at a large dataset.
At most algorithms, sklearn recognizes when you use a binary classifier for a multiclass classification task and automatically uses the OvA strategy. There is an exception: When you try to use a Support Vector Machine classifier, it automatically runs the OvO strategy.
Other Classification Algorithms
Other common classification algorithms are Naive Bayes, Decision Trees, Random Forests, Support Vector Machines, k-nearest neighbor and many others. We will also discuss them in future blog posts but don’t feel overwhelmed by the amount of Machine Learning algorithms that are out there. Note that it is better to know 4 or 5 algorithms really well and to concentrate your energy at feature-engineering, but this is also a topic for future posts.
In this post, you have learned what Logistic Regression is and how it works. You now have a solid understanding of its advantages and disadvantages and know when you can use it. Also, you have discovered ways to use Logistic Regression to do multiclass classification with sklearn and why it is a good baseline to compare other Machine Learning algorithms with.
Transfer Learning is the reuse of a pre-trained model on a new problem. It is currently very popular in the field of Deep Learning because it enables you to train Deep Neural Networks with comparatively little data. This is very useful since most real-world problems typically do not have millions of labeled data points to train such complex models. This blog post is intended to give you an overview of what Transfer Learning is, how it works, why you should use it and when you can use it. It will introduce you to the different approaches of Transfer Learning and provide you with some resources on already pre-trained models.
Table of Contents:
What is it?
How it works
Why it is used?
When you should use it
Approaches to Transfer Learning
Training a Model to Reuse it
Using a Pre-Trained Model
What is it?
In Transfer Learning, the knowledge of an already trained Machine Learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses.
With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a Network has learned at Task A to a new Task B.
The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data. Instead of starting the learning process from scratch, you start from patterns that have been learned from solving a related task.
Transfer Learning is mostly used in Computer Vision and Natural Language Processing Tasks like Sentiment Analysis, because of the huge amount of computational power that is needed for them.
It is not really a Machine Learning technique. Transfer Learning can be seen as a ‘design methodology’ within Machine Learning like for example, active learning. It is also not an exclusive part or study-area of Machine Learning. Nevertheless, it has become quite popular in the combination with Neural Networks, since they require huge amounts of data and computational power.
How it works
For example, in computer vision, Neural Networks usually try to detect edges in their earlier layers, shapes in their middle layer and some task-specific features in the later layers. With transfer learning, you use the early and middle layers and only re-train the latter layers. It helps us to leverage the labeled data of the task it was initially trained on.
Let’s go back to the example of a model trained for recognizing a backpack on an Image, which will be used to identify Sunglasses. In the earlier layers, the model has learned to recognize objects and because of that, we will only re-train the latter layers, so that it will learn what separates sunglasses from other objects.
In Transfer Learning, we try to transfer as much knowledge as possible from the previous task, the model was trained on, to the new task at hand. This knowledge can be in various forms depending on the problem and the data. For example, it could be how models are composed which would allow us to more easily identify novel objects.
Why it is used?
Using Transfer Learning has several benefits that we will discuss in this section. The main advantages are basically that you save training time, that your Neural Network performs better in most cases and that you don’t need a lot of data.
Usually, you need a lot of data to train a Neural Network from scratch but you don’t always have access to enough data. That is where Transfer Learning comes into play because with it you can build a solid machine Learning model with comparatively little training data because the model is already pre-trained. This is especially valuable in Natural Language Processing (NLP) because there is mostly expert knowledge required to created large labeled datasets. Therefore you also save a lot of training time, because it can sometimes take days or even weeks to train a deep Neural Network from scratch on a complex task.
According to Demis Hassabis, the CEO of DeepMind Technologies, Transfer is also one of the most promising techniques that could someday lead us to Artificial General Intelligence (AGI):
When you should use it
As it is always the case in Machine Learning, it is hard to form rules that are generally applicable. But I will provide you with some guidelines.
You would typically use Transfer Learning when (a) you don’t have enough labeled training data to train your network from scratch and/or (b) there already exists a network that is pre-trained on a similar task, which is usually trained on massive amounts of data. Another case where its use would be appropriate is when Task-1 and Task-2 have the same input.
If the original model was trained using TensorFlow, you can simply restore it and re-train some layers for your task. Note that Transfer Learning only works if the features learned from the first task are general, meaning that they can be useful for another related task as well. Also, the input of the model needs to have the same size as it was initially trained with. If you don’t have that, you need to add a preprocessing step to resize your input to the needed size.
Approaches to Transfer Learning
Now we will discuss different approaches to Transfer Learning. Note that these have different names throughout literature but the overall concept is mostly the same.
1. Training a Model to Reuse it
Imagine you want to solve Task A but don’t have enough data to train a Deep Neural Network. One way around this issue would be to find a related Task B, where you have an abundance of data. Then you could train a Deep Neural Network on Task B and use this model as starting point to solve your initial Task A. If you have to use the whole model or only a few layers of it, depends heavily on the problem you are trying to solve.
If you have the same input in both Tasks, you could maybe just reuse the model and make predictions for your new input. Alternatively, you could also just change and re-train different task-specific layers and the output layer.
2. Using a Pre-Trained Model
Approach 2 would be to use an already pre-trained model. There are a lot of these models out there, so you have to do a little bit of research. How many layers you reuse and how many you are training again, depends like I already said on your problem and it is therefore hard to form a general rule.
Keras, for example, provides nine pre-trained models that you can use for Transfer Learning, Prediction, feature extraction and fine-tuning. You can find these models and also some brief tutorial on how to use them here.
There are also many research institutions that released models they have trained. This type of Transfer Learning is most commonly used throughout Deep Learning.
3. Feature Extraction
Another approach is to use Deep Learning to discover the best representation of your problem, which means finding the most important features. This approach is also known as Representation Learning and can often result in a much better performance than can be obtained with hand-designed representation.
Most of the time in Machine Learning, features are manually hand-crafted by researchers and domain experts. Fortunately, Deep Learning can extract features automatically. Note that this does not mean that Feature Engineering and Domain knowledge isn’t important anymore because you still have to decide which features you put into your Network. But Neural Networks have the ability to learn which features, you have put into it, are really important and which ones aren’t. A representation learning algorithm can discover a good combination of features within a very short timeframe, even for complex tasks which would otherwise require a lot of human effort.
The learned representation can then be used for other problems as well. You simply use the first layers to spot the right representation of features but you don’t use the output of the network because it is too task-specific. Simply feed data into your network and use one of the intermediate layers as the output layer. This layer can then be interpreted as a representation of the raw data.
This approach is mostly used in Computer Vision because it can reduce the size of your dataset, which decreases computation time and makes it more suitable for traditional algorithms as well.
Popular Pre-Trained Models
There are a some pre-trained Machine Learning models out there that became quite popular. One of them is the Inception-v3 model, which was trained for the ImageNet “Large Visual Recognition Challenge”. In this challenge, participants had to classify images into 1000 classes, like “Zebra”, “Dalmatian”, and “Dishwasher”.
Here you can see a very good tutorial from TensorFlow on how to retrain image classifiers.
Other quite popular models are ResNet and AlexNet. I also encourage you to visit pretrained.ml which is a sortable and searchable compilation of pre-trained deep learning models, along with demos and code.
In this post, you have learned what Transfer Learning is and why it matters. You also discovered how it is done along with some of its benefits. We talked about why it can reduce the size of your dataset, why it decreases training time and why you also need less data when you use it. We discussed when it is appropriate to do Transfer Learning and what are the different approaches to it. Lastly, I provided you with a collection of models that are already pre-trained.
Deep Learning enjoys a massive hype at the moment. People want to use Neural Networks everywhere, but are they always the right choice? That will be discussed in the following sections, along with why Deep Learning is so popular right now. After reading it, you will know the main disadvantages of Neural Networks and you will have a rough guideline when it comes to choosing the right type of algorithm for your current Machine Learning problem. You will also learn about what I think is one of the major problems in Machine Learning we are facing right now.
Table of Contents:
Why Deep Learning is so hyped
Neural Networks vs. traditional Algorithms
Duration of Development
Amount of Data
Why Deep Learning is so hyped
Deep Learning enjoys its current hype for four main reasons. These are data, computational power, the algorithms itself and marketing. We will discuss each of them in the following sections.
One of the things that increased the popularity of Deep Learning is the massive amount of data that is available in 2018, which has been gathered over the last years and decades. This enables Neural Networks to really show their potential since they get better the more data you fed into them.
In comparison, traditional Machine Learning algorithms will certainly reach a level, where more data doesn’t improve their performance. The chart below illustrates that perfectly:
2. Computational Power
Another very important reason is the computational power that is available nowadays, which enables us to process more data. According to Ray Kurzweil, a leading figure in Artificial Intelligence, computational power is multiplied by a constant factor for each unit of time (e.g., doubling every year) rather than just being added to incrementally. This means that computational power is increasing exponentially.
The third factor that increased the popularity of Deep Learning is the advances that have been made in the algorithms itself. These recent breakthroughs in the development of algorithms are mostly due to making them run much faster than before, which makes it possible to use more and more data.
Also important was marketing. Neural Networks are around for decades (proposed in 1944 for the first time) and already faced some hypes but also times where no one wanted to believe and invest in it. The phrase „Deep Learning“ gave it a new fancy name, which made a new hype possible, which is also the reason why many people wrongly think that Deep Learning is a newly created field.
Also, other things contributed to the marketing of deep learning, like for example the controversial „humanoid“ robot Sophia from Hanson robotics and several breakthroughs in major fields of Machine Learning that made it into mass-media and much more.
Neural Networks vs. traditional Algorithms
When you should use Neural Networks or traditional Machine Learning algorithms is a hard question to answer because it depends heavily on the problem you are trying to solve. This is also due to the „no free lunch theorem“, which roughly states that there is no „perfect“ Machine Learning algorithm that will perform well at any problem. For every problem, a certain method is suited and achieves good results while another method fails heavily. But I personally see this as one of the most interesting parts of Machine Learning. It is also the reason why you need to be proficient with several algorithms and why getting your hands dirty through practice is the only way to get a good Machine Learning Engineer or Data Scientist. Nevertheless, I will provide you some guidelines in this post that should help you to better understand when you should use which type of algorithm.
The main advantage of Neural Network lies in their ability to outperform nearly every other Machine Learning algorithms, but this goes along with some disadvantages that we will discuss and lay our focus on during this post. Like I already mentioned, to decide whether or not you should use Deep Learning depends mostly on the problem you are trying to solve with it. For example, in cancer detection, a high performance is crucial because the better the performance is the more people can be treated. But there are also Machine Learning problems where a traditional algorithm delivers a more than satisfying result.
The probably best-known disadvantage of Neural Networks is their “black box” nature, meaning that you don’t know how and why your NN came up with a certain output. For example, when you put in an image of a cat into a neural network and it predicts it to be a car, it is very hard to understand what caused it to came up with this prediction. When you have features that are human interpretable, it is much easier to understand the cause of its mistake. In Comparison, algorithms like Decision trees are very interpretable. This is important because in some domains, interpretability is quite important.
This is why a lot of banks don’t use Neural Network to predict whether a person is creditworthy because they need to explain to their customers why they don’t get a loan. Otherwise, the person may feel wrongly threatened by the Bank, because he can not understand why he doesn’t get a loan, which could lead him to change his bank. The same thing is true for sites like Quora. If they would decide to delete a users account because of a Machine Learning algorithm, they would need to explain to their user why they have done it. I doubt that they will be satisfied with an answer such as “that’s what the computer said”.
Other scenarios would be important business decisions, driven by Machine Learning. Can you imagine that a CEO of a big company will make a decision about millions of dollars without understanding why it should be done, just because the „computer“ says he needs to do so?
Although there are libraries like Keras out there, which make the development of Neural Networks fairly simple, you sometimes need more control over the details of the Algorithm, when for example you trying to solve a difficult problem with Machine Learning that no one has ever done before.
Then you probably use Tensorflow, which provides you with much more opportunities but because of that it is also more complicated and the development takes much longer (depending on what you want to build). Then the question arises for a companies management if it is really worth it that their expensive engineers spend weeks to develop something, which may be solved much faster with a simpler algorithm.
3. Amount of Data
Neural Networks usually require much more data than traditional Machine Learning algorithms, as in at least thousands if not millions of labeled samples. This isn’t an easy problem to deal with and many Machine Learning problems can be solved well with less data if you use other algorithms.
Although there are some cases where NN’s deal well with little data, most of the time they don’t. In this case, a simple algorithm like Naive Bayes, which deals much better with little data, would be the appropriate choice.
Usually, Neural Networks are also more computationally expensive than traditional algorithms. State of the art deep learning algorithms, which realize successful training of really deep Neural Network, can take several weeks to train completely from scratch. Most traditional Machine Learning Algorithms take much less time to train, ranging from a few minutes to a few hours or days.
The amount of computational power needed for a Neural Network depends heavily on the size of your data but also on how deep and complex your Network is. For example, a Neural Network with one layer and 50 neurons will be much faster than a Random Forest with 1,000 trees. In comparison, a Neural Network with 50 layers will be much slower than a Random Forest with only 10 trees.
Great! Now you know that Neural networks are great for some tasks but not as great for others. You learned that huge amounts of data, more computational power, better algorithms and intelligent marketing increased the popularity of Deep Learning and made it into one of the hottest fields right now. On top of that, you have learned that Neural Networks can beat nearly every other Machine Learning algorithms and the disadvantages that go along with it. The biggest disadvantages are their „black box“ nature, increased duration of development (depending on your problem), the required amount of data and that they are mostly computational expensive.
In my opinion, Deep Learning is a little bit over-hyped at the moment and the expectations exceed what can be really done with it right now. But that doesn’t mean it is not useful. I think we live in a Machine Learning renaissance because it gets more and more democratized which enables more and more people to build useful products with it. Out there are a lot of problems that can be solved with Machine Learning and I am sure this will happen in the next few years.
One of the major problems is that only a few people understand what can be really done with it and know how to build successful Data Science teams that bring real value to a company. On one hand, we have PhD-level engineers that are geniuses in regards to the theory behind Machine Learning but lack an understanding of the business side. And on the other hand, we have CEO’s and people in management positions that have no idea what can be really done with Deep Learning and think that it will solve all of the world’s problems in the next years to come. In my opinion, we need more people that bridge this gap, which will result in more products that are useful for our society
Using the right evaluation metrics for your classification system is crucial. Otherwise, you could fall into the trap of thinking that your model performs well but in reality, it doesn’t. In this post, you will learn why it is trickier to evaluate classifiers, why a high classification accuracy is in most cases not as desirable as it sounds, what are the right evaluation metrics and when you should use them. You will also discover how you can create a classifier with virtually any precision you want.
Table of Contents:
Why it is important?
Precision and Recall
ROC AUC Curve and ROC AUC Score
Why it is important?
Evaluating a classifier is often much more difficult than evaluating a regression algorithm. A good example is the famous MNIST dataset that contains images of handwritten digits from 0 to 9. If we would want to build a classifier that classifies a 6, the algorithm could classify every input as non-6 and get a 90% accuracy, because only about 10% of the images within the dataset are 6’s. This is a major issue in machine learning and the reason why you need to look at several evaluation metrics for your classification system.
First, you can take a look at the Confusion Matrix which is also known as error matrix. It is a table that describes the performance of a supervised machine learning model on the testing data, where the true values are unknown. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (and vice versa). It is called „confusion matrix“ because it makes it easy to spot where your system is confusing two classes.
Below you can see the output of the „confusion_matrix()“ function of sklearn, used on the mnist dataset.
Each row represents an actual class and each column represents a predicted class.
The first row is about non 6 images (the negative class), where 53459 images were correctly classified as non-6s (called true negatives). The remaining 623 images were wrongly classified as 6s (false positives).
The second row represents the actual 6 images. 473 were wrongly classified as non-6s (false negatives). 5445 were correctly classified as 6 (true positives). Note that the perfect classifier would be right 100% of the time, which means he would have only true positives and true negatives.
Precision and Recall
A confusion matrix gives you a lot of information about how well your model does, but there’s a way to get even more, like computing the classifiers precision. This is basically the accuracy of the positive predictions and it is typically viewed together with the “recall”, which is the ratio of correctly detected positive instances.
Fortunately, sklearn provides build-in functions to compute both of them:
Now we have a much better evaluation of our classifier. Our model classifies 89% of the time images correctly as a 6. The precision tells us that it predicted 92 % of the 6s as a 6. But there is still a better way!
You can combine precision and recall into one single metric, called the F-score (also called F1-Score). The F-score is really useful if you want to compare 2 classifiers. It is computed using the harmonic mean of precision and recall, and gives much more weight to low values. As a result of that, the classifier will only get a high F-score, if both recall and precision are high. You can easily compute the F-Score with sklearn.
Below you can see that our model get a 90% F-Score:
But unfortunately, the F-score isn’t the holy grail and has its tradeoffs. It favors classifiers that have similar precision and recall. This is a problem because you sometimes want a high precision and sometimes a high recall. The thing is that an increasing precision results in a decreasing recall and vice versa. This is called the precision/recall tradeoff and we will cover it in the next section.
To illustrate this tradeoff a little bit better, I will give you examples of when you want a high precision and when you want a high recall.
You probably want a high precision, if you trained a classifier to detect videos that are suited for children. This means you want a classifier that may reject a lot of videos that would be actually suited for kids, but never shows you a video that contains adult content. therefore only shows safe ones (e.g a high precision)
An example where you would need a high recall, is if you train a classifier that detects people, who are trying to break into a building. It would be fine if the classifier has only 25 % precision (would result in some false alarms), as long as it has a 99% recall and alarms you nearly every time when someone tries to break in.
To understand this tradeoff even better, we will look at how the SGDClassifier makes it’s classification decisions in regards to the MNIST dataset. For each image it has to classify, it computes a score based on a decision function and it classifies the image as one number (when the score is bigger the than threshold) or another (when the score is smaller than the threshold).
The picture below shows digits, that are ordered from the lowest (left) to the highest score (right). Let’s suppose you have a classifier that should detect 5s and the threshold is positioned at the middle of the picture (at the central arrow). Then you would spot 4 true positives (actual 5s) and one false positive (actually a 6) on the right of it. The positioning of that threshold would result in an 80% precision (4 out of 5), but out of the six actual 5s on the picture, he would only identify 4 so the recall would be 67% (4 out of 6). If you would now move the threshold to the right arrow, it would result in a higher precision but in a lower recall and vice versa if you move the threshold to the left arrow.
The trade-off between precision and recall can be observed using the precision-recall curve, and it lets you spot which threshold is the best.
Another way is to plot the precision and recall against each other:
In the image above you can clearly see that the recall is falling off sharply at a precision of around 95%. Therefore you probably want to choose to select the precision/recall tradeoff before that – maybe at around 85 %. Because of the two plots above, you are now able to choose a threshold, that gives you the best precision/recall tradeoff for your current machine learning problem. If you want for example a precision of 85%, you can easily look at the plot on the first image and see that you would need a threshold of around – 50,000.
ROC AUC Curve and ROC AUC Score
The ROC curve is another tool to evaluate and compare binary classifiers. It has a lot of similarities with the precision/recall curve, although it is quite different. It plots the true positive rate (also called recall) against the false positive rate (ratio of incorrectly classified negative instances), instead of plotting the precision versus the recall.
Of course, we also have a tradeoff here, because the classifier produces more false positives, the higher the true positive rate is. The red line in the middle is a purely random classifier and therefore your classifier should be as far away from it as possible.
The ROC curve provides also a way to compare two classifiers with each other, by measuring the area under the curve (called AUC). This is the ROC AUC score. Note that a classifier that is 100% correct, would have a ROC AUC of 1. A completely random classifier would have a score of 0.5. Below you can see the output of the mnist model:
Keras is one of the most popular Deep Learning libraries out there at the moment and made a big contribution to the commoditization of artificial intelligence. It is simple to use and it enables you to build powerful Neural Networks in just a few lines of code. In this post, you will discover how you can build a Neural Network with Keras that predicts the sentiment of user reviews by categorizing them into two categories: positive or negative. This is called Sentiment Analysis and we will do it with the famous imdb review dataset. The model we will build can also be applied to other Machine Learning problems with just a few changes.
Note that we will not go into the details of Keras or Deep Learning. This post is intended to provide you with a blueprint of a Keras Neural Network and to make you familiar with its implementation.
Table of Contents:
What is Keras?
What is Sentiment Analysis?
The imdb Dataset
Import Dependencies and get the Data
Exploring the Data
Building and Training the Model
What is Keras?
Keras is an open source python library that enables you to easily build Neural Networks. The library is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, and MXNet. Tensorflow and Theano are the most used numerical platforms in Python to build Deep Learning algorithms but they can be quite complex and difficult to use. In comparison, Keras provides an easy and convenient way to build deep learning models. It’s creator François Chollet developed it to enable people to build Neural Networks as fast and easy as possible. He laid his focus on extensibility, modularity, minimalism and the support of python. Keras can be used with GPUs and CPUs and it supports both Python 2 and 3. Google Keras made a big contribution to the commoditization of deep learning and artificial intelligence since it has commoditized powerful, modern Deep Learning algorithms that previously were not only inaccessible but also unusable as well.
What is Sentiment Analysis?
With Sentiment Analysis, we want to determine the attitude (e.g the sentiment) of for example a speaker or writer with respect to a document, interaction, or event. Therefore it is a natural language processing problem where text needs to be understood, to predict the underlying intent. The sentiment is mostly categorized into positive, negative and neutral categories. With the use of Sentiment Analysis, we want to predict for example a customers opinion and attitude about a product based on a review he wrote about it. Because of that, Sentiment Analysis is widely applied to things like reviews, surveys, documents and much more.
The imdb Dataset
The imdb sentiment classification dataset consists of 50,000 movie reviews from imdb users that are labeled as either positive (1) or negative (0). The reviews are preprocessed and each one is encoded as a sequence of word indexes in the form of integers. The words within the reviews are indexed by their overall frequency within the dataset. For example, the integer “2” encodes the second most frequent word in the data. The 50,000 reviews are split into 25,000 for training and 25,000 for testing. The dataset was created by researchers of the Stanford University and published in a paper in 2011, where they achieved 88.89% accuracy. It was also used within the “Bag of Words Meets Bags of Popcorn” Kaggle competition in 2011.
Import Dependencies and get the Data
We start by importing the required dependencies to preprocess our data and to build our model.
%matplotlib inlineimport matplotlibimport matplotlib.pyplot as plt
import numpy as npfrom keras.utils import to_categoricalfrom keras import modelsfrom keras import layers
We continue with downloading the imdb dataset, which is fortunately already built into Keras. Since we don’t want to have a 50/50 train test split, we will immediately merge the data into data and targets after downloading, so that we can do an 80/20 split later on.
from keras.datasets import imdb(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
data = np.concatenate((training_data, testing_data), axis=0)targets = np.concatenate((training_targets, testing_targets), axis=0)
Exploring the Data
Now we can start exploring the dataset:
print("Categories:", np.unique(targets))print("Number of unique words:", len(np.unique(np.hstack(data))))
Categories: [0 1]
Number of unique words: 9998
length = [len(i) for i in data]print("Average Review length:", np.mean(length))print("Standard Deviation:", round(np.std(length)))
Average Review length: 234.75892
Standard Deviation: 173.0
You can see in the output above that the dataset is labeled into two categories, either 0 or 1, which represents the sentiment of the review. The whole dataset contains 9998 unique words and the average review length is 234 words, with a standard deviation of 173 words.
Above you see the first review of the dataset which is labeled as positive (1). The code below retrieves the dictionary mapping word indices back into the original words so that we can read them. It replaces every unknown word with a “#”. It does this by using the get_word_index() function.
index = imdb.get_word_index()reverse_index = dict([(value, key) for (key, value) in index.items()]) decoded = " ".join( [reverse_index.get(i - 3, "#") for i in data] )print(decoded)
# this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert # is an amazing actor and now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also # to the two little boy's that played the # of norman and paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
Now it is time to prepare our data. We will vectorize every review and fill it with zeros so that it contains exactly 10,000 numbers. That means we fill every review that is shorter than 10,000 with zeros. We do this because the biggest review is nearly that long and every input for our neural network needs to have the same size. We also transform the targets into floats.
def vectorize(sequences, dimension = 10000):results = np.zeros((len(sequences), dimension))for i, sequence in enumerate(sequences):results[i, sequence] = 1return resultsdata = vectorize(data)targets = np.array(targets).astype("float32")
Now we split our data into a training and a testing set. The training set will contain 40,000 reviews and the testing set 10,000.
Then we simply add the input-, hidden- and output-layers. Between them, we are using dropout to prevent overfitting. Note that you should always use a dropout rate between 20% and 50%. At every layer, we use “Dense” which means that the units are fully connected. Within the hidden-layers, we use the relu function, because this is always a good start and yields a satisfactory result most of the time. Feel free to experiment with other activation functions. And at the output-layer, we use the sigmoid function, which maps the values between 0 and 1. Note that we set the input-shape to 10,000 at the input-layer, because our reviews are 10,000 integers long. The input-layer takes 10,000 as input and outputs it with a shape of 50.
Lastly, we let Keras print a summary of the model we have just built.
Now we need to compile our model, which is nothing but configuring the model for training. We use the “adam” optimizer. The optimizer is the algorithm that changes the weights and biases during training. We also choose binary-crossentropy as loss (because we deal with binary classification) and accuracy as our evaluation metric.
model.compile( optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])
We are now able to train our model. We do this with a batch_size of 500 and only for two epochs because I recognized that the model overfits if we train it longer. The Batch size defines the number of samples that will be propagated through the network and an epoch is an iteration over the entire training data. In general a larger batch-size results in faster training, but don’t always converges fast. A smaller batch-size is slower in training but it can converge faster. This is definitely problem dependent and you need to try out a few different values. If you start with a problem for the first time, I would you recommend to you to first use a batch-size of 32, which is the standard size.
Awesome! With this simple model, we already beat the accuracy of the 2011 paper that I mentioned in the beginning. Feel free to experiment with the hyperparameters and the number of layers.
You can see the code for the whole model below:
import numpy as np
from keras.utils import to_categorical
from keras import models
from keras import layers
from keras.datasets import imdb
(training_data, training_targets), (testing_data, testing_targets) = imdb.load_data(num_words=10000)
data = np.concatenate((training_data, testing_data), axis=0)
targets = np.concatenate((training_targets, testing_targets), axis=0)
def vectorize(sequences, dimension = 10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1
test_x = data[:10000]
test_y = targets[:10000]
train_x = data[10000:]
train_y = targets[10000:]
model = models.Sequential()
# Input - Layer
model.add(layers.Dense(50, activation = "relu", input_shape=(10000, )))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(50, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))
# compiling the model
optimizer = "adam",
loss = "binary_crossentropy",
metrics = ["accuracy"]
results = model.fit(
batch_size = 500,
validation_data = (test_x, test_y)
In this Post you learned what Sentiment Analysis is and why Keras is one of the most used Deep Learning libraries. On top of that you learned that Keras made a big contribution to the commoditization of deep learning and artificial intelligence. You learned how to build a simple Neural Network with six layers that can predict the sentiment of movie reviews, which achieves a 89% accuracy. You can now use this model to also do binary sentiment analysis on other sources of text but you need to change them all to a length of 10,000 or you change the input-shape of the input layer. You can also apply this model to other related machine learning problems with only a few changes.