In this tutorial, I’ll explain how to use the NumPy random seed function, which is also called np.random.seed or numpy.random.seed.
The function itself is extremely easy to use.
However, the reason that we need to use it is a little complicated. To understand why we need to use NumPy random seed, you actually need to know a little bit about pseudo-random numbers.
That being the case, this tutorial will first explain the basics of pseudo-random numbers, and will then move on to the syntax of numpy.random.seed itself.
Contents
The tutorial is divided up into several different sections.
You can click on any of the above links, and it will take you directly to that section.
However, I strongly recommend that you read the whole tutorial.
As I said earlier, numpy.random.seed is very easy to use, but it’s not that easy to understand. Understanding why we use it requires some background. That being the case, it’s much better if you actually read the tutorial.
Ok … let’s get to it.
NumPy random seed is for pseudo-random numbers in Python
So what exactly is NumPy random seed?
NumPy random seed is simply a function that sets the random seed of the NumPy pseudo-random number generator. It provides an essential input that enables NumPy to generate pseudo-random numbers for random processes.
Does that make sense? Probably not.
Unless you have a background in computing and probability, what I just wrote is probably a little confusing.
Honestly, in order to understand “seeding a random number generator” you need to know a little bit about pseudo-random numbers.
That being the case, let me give you a quick introduction to them …
A quick introduction to pseudo-random numbers
Here, I want to give you a very quick overview of pseudo-random numbers and why we need them.
Once you understand pseudo-random numbers, numpy.random.seed will make more sense.
WTF is a pseudo-random number?
At the risk of being a bit of a smart-ass, I think the name “pseudo-random number” is fairly self explanatory, and it gives us some insight into what pseudo-random numbers actually are.
Let’s just break down the name a little.
A pseudo-random number is a number. A number that’s sort-of random. Pseudo-random.
So essentially, a pseudo-random number is a number that’s almost random, but not really random.
It might sound like I’m being a bit sarcastic here, but that’s essentially what they are. Pseudo-random numbers are numbers that appear to be random, but are not actually random.
In the interest of clarity though, let’s see if we can get a definition that’s a little more precise.
The prefix pseudo- is used to distinguish this type of number from a “truly” random number generated by a random physical process such as radioactive decay.
Got that? Pseudo-random numbers are computer generated numbers that appear random, but are actually predetermined.
I think that these definitions help quite a bit, and they are a great starting point for understanding why we need them.
Why we need pseudo-random numbers
I swear to god, I’m going to bring this back to NumPy soon.
But, we still need to understand why pseudo-random numbers are required.
Really. Just bear with me. This will make sense soon.
A problem … computers are deterministic, not random
There’s a fundamental problem when using computers to simulate or work with random processes.
Computers are completely deterministic, not random.
Setting aside some rare exceptions, computers are deterministic by their very design. To quote an article at MIT’s School of Engineering “if you ask the same question you’ll get the same answer every time.”
Another way of saying this is that if you give a computer a certain input, it will precisely follow instructions to produce an output.
… And if you later give a computer the same input, it will produce the same output.
If the input is the same, then the output will be the same.
THAT’S HOW COMPUTERS WORK.
The behavior of computers is deterministic …
Essentially, the behavior of computers is NOT random.
This introduces a problem: how can you use a non-random machine to produce random numbers?
pseudo-random numbers are generated by algorithms
Computers solve the problem of generating “random” numbers the same way that they solve essentially everything: with an algorithm.
Computer scientists have created a set of algorithms for creating psuedo random numbers, called “pseudo-random number generators.”
These algorithms can be executed on a computer.
As such, they are completely deterministic. However, the numbers that they produce have properties that approximate the properties of random numbers.
pseudo-random numbers appear to be random
That is to say, the numbers generated by pseudo-random number generators appear to be random.
Even though the numbers they are completely determined by the algorithm, when you examine them, there is typically no discernible pattern.
For example, here we’ll create some pseudo-random numbers with the NumPy randint function:
np.random.seed(1)
np.random.randint(low = 1, high = 10, size = 50))
I can assure you though, that these numbers are not random, and are in fact completely determined by the algorithm. If you run the same code again, you’ll get the exact same numbers.
pseudo-random numbers can be re-created exactly
Importantly, because pseudo-random number generators are deterministic, they are also repeatable.
What I mean is that if you run the algorithm with the same input, it will produce the same output.
So you can use pseudo-random number generators to create and then re-create the exact same set of pseudo-random numbers.
Let me show you.
Generate pseudo-random integers
Here, we’ll create a list of 5 pseudo-random integers between 0 and 9 using numpy.random.randint.
(And notice that we’re using np.random.seed here)
np.random.seed(0)
np.random.randint(10, size = 5)
This produces the following output:
array([5, 0, 3, 3, 7])
Simple. The algorithm produced an array with the values [5, 0, 3, 3, 7].
Generate pseudo-random integers again
Ok.
Now, let’s run the same code again.
… and notice that we’re using np.random.seed in exactly the same way …
np.random.seed(0)
np.random.randint(10, size = 5)
OUTPUT:
array([5, 0, 3, 3, 7])
Well take a look at that …
The. numbers. are. the. same.
We ran the exact same code, and it produced the exact same output.
I will repeat what I said earlier: pseudo random number generators produce numbers that look random, but are 100% determined.
Determined how though?
Remember what I wrote earlier: computers and algorithms process inputs into outputs. The outputs of computers depend on the inputs.
So just like any output produced by a computer, pseudo-random numbers are dependent on the input.
THIS is where numpy.random.seed comes in …
The numpy.random.seed function provides the input (i.e., the seed) to the algorithm that generates pseudo-random numbers in NumPy.
How and why we use NumPy random seed
Ok, you got this far.
You’re ready now.
Now you can learn about NumPy random seed.
numpy.random.seed provides an input to the pseudo-random number generator
What I wrote in the previous section is critical.
The “random” numbers generated by NumPy are not exactly random. They are pseudo-random … they approximate random numbers, but are 100% determined by the input and the pseudo-random number algorithm.
The np.random.seed function provides an input for the pseudo-random number generator in Python.
That’s all the function does!
It allows you to provide a “seed” value to NumPy’s random number generator.
We use numpy.random.seed in conjunction with other numpy functions
Importantly, numpy.random.seed doesn’t exactly work all on its own.
The numpy.random.seed function works in conjunction with other functions from NumPy.
Specifically, numpy.random.seed works with other function from the numpy.random namespace.
So for example, you might use numpy.random.seed along with numpy.random.randint. This will enable you to create random integers with NumPy.
In fact, there are several dozen NumPy random functions that enable you to generate random numbers, random samples, and samples from specific probability distributions.
I’ll show you a few examples of some of these functions in the examples section of this tutorial.
NumPy random seed is deterministic
Remember what I said earlier in this tutorial …. pseudo-random number generators are completely deterministic. They operate by algorithm.
What this means is that if you provide the same seed, you will get the same output.
And if you change the seed, you will get a different output.
The output that you get depends on the input that you give it.
The important thing about using a seed for a pseudo-random number generator is that it makes the code repeatable.
Remember what I said earlier?
… pseudo-random number generators operate by a deterministic process.
If you give a pseudo-random number generator the same input, you’ll get the same output.
This can actually be a good thing!
There are times when you really want your “random” processes to be repeatable.
Code that has well defined, repeatable outputs is good for testing.
Essentially, we use NumPy random seed when we need to generate pseudo-random numbers in a repeatable way.
NumPy random seed makes your code easier to share
The fact that np.random.seed makes your code repeatable also makes is easier to share.
Take for example the tutorials that I post here at Sharp Sight.
I post detailed tutorials about how to perform various data science tasks, and I show how code works, step by step.
When I do this, it’s important that people who read the tutorials and run the code get the same result. If a student reads the tutorial, and copy-and-pastes the code exactly, I want them to get the exact same result. This just helps them check their work! If they type in the code exactly as I show it in a tutorial, getting the exact same result gives them confidence that they ran the code properly.
Again, in order to get repeatable results when we are using “random” functions in NumPy, we need to use numpy.random.seed.
Ok … now that you understand what NumPy random seed is (and why we use it), let’s take a look at the actual syntax.
The syntax of NumPy random seed
The syntax of NumPy random seed is extremely simple.
There’s essentially only one parameter, and that is the seed value.
So essentially, to use the function, you just call the function by name and then pass in a “seed” value inside the parenthesis.
Note that in this syntax explanation, I’m using the abbreviation “np” to refer to NumPy. This is a common convention, but it requires you to import NumPy with the code “import numpy as np.” I’ll explain more about this soon in the examples section.
Examples of numpy.random.seed
Let’s take a look at some examples of how and when we use numpy.random.seed.
Before we look at the examples though, you’ll have to run some code.
Run this code first
To get the following examples to run properly, you’ll need to import NumPy with the appropriate “nickname.”
You can do that by executing the following code:
import numpy as np
Running this code will enable us to use the alias np in our syntax to refer to numpy.
This is a common convention in NumPy. When you read NumPy code, it is extremely common to see NumPy referred to as np. If you’re a beginner you might not realize that you need to import NumPy with the code import numpy as np, otherwise the examples won’t work properly!
Now that we’ve imported NumPy properly, let’s start with a simple example. We’ll generate a single random number between 0 and 1 using NumPy random random.
Generate a random number with numpy.random.random
Here, we’re going to use NumPy to generate a random number between zero and one. To do this, we’re going to use the NumPy random random function (AKA, np.random.random).
Ok, here’s the code:
np.random.seed(0)
np.random.random()
OUTPUT:
0.5488135039273248
Note that the output is a float. It’s a decimal number between 0 and 1.
For the record, we can essentially treat this number as a probability. We can think of the np.random.random function as a tool for generating probabilities.
Rerun the code
Now that I’ve shown you how to use np.random.random, let’s just run it again with the same seed.
Here, I just want to show you what happens when you use np.random.seed before running np.random.random.
np.random.seed(0)
np.random.random()
OUTPUT:
0.5488135039273248
Notice that the number is exactly the same as the first time we ran the code.
Essentially, if you execute a NumPy function with the same seed, you’ll get the same result.
Generate a random integer with numpy.random.randint
Next, we’re going to use np.random.seed to set the number generator before using NumPy random randint.
Essentially, we’re going to use NumPy to generate 5 random integers between 0 and 99.
np.random.seed(74)
np.random.randint(low = 0, high = 100, size = 5)
OUTPUT:
array([30, 91, 9, 73, 62])
This is pretty simple.
NumPy random seed sets the seed for the pseudo-random number generator, and then NumPy random randint selects 5 numbers between 0 and 99.
Run the code again
Let’s just run the code so you can see that it reproduces the same output if you have the same seed.
np.random.seed(74)
np.random.randint(low = 0, high = 100, size = 5)
OUTPUT:
array([30, 91, 9, 73, 62])
Once again, as you can see, the code produces the same integers if we use the same seed. As noted previously in the tutorial, NumPy random randint doesn’t exactly produce “random” integers. It produces pseudo-random integers that are completely determined by numpy.random.seed.
Select a random sample from an input array
It’s also common to use the NP random seed function when you’re doing random sampling.
Specifically, if you need to generate a reproducible random sample from an input array, you’ll need to use numpy.random.seed.
Let’s take a look.
Here, we’re going to use numpy.random.seed before we use numpy.random.choice. The NumPy random choice function will then create a random sample from a list of elements.
As you can see, we’ve basically generated a random sample from the list of input elements … the numbers 1 to 6.
In the output, you can see that some of the numbers are repeated. This is because np.random.choice is using random sampling with replacement. For more information about how to create random samples, you should read our tutorial about np.random.choice.
Rerun the code
Let’s quickly re-run the code.
I want to re-run the code just so you can see, once again, that the primary reason we use NumPy random seed is to create results that are completely repeatable.
Ok, here is the exact same code that we just ran (with the same seed).
Once again, we used the same seed, and this produced the same output.
Frequently asked questions about np.random.seed
Now that we’ve taken a look at some examples of using NumPy random seed to set a random seed in Python, I want to address some frequently asked questions.
What does np.random.seed(0) do?
Dude. I just wrote 2000 words explaining what the np.random.seed function does … which basically explains what np.random.seed(0) does.
Ok, ok … I get it. You’re probably in a hurry and just want a quick answer.
I’ll summarize.
We use np.random.seed when we need to generate random numbers or mimic random processes in NumPy.
Computers are generally deterministic, so it’s very difficult to create truly “random” numbers on a computer. Computers get around this by using pseudo-random number generators.
These pseudo-random number generators are algorithms that produce numbers that appear random, but are not really random.
In order to work properly, pseudo-random number generators require a starting input. We call this starting input a “seed.”
The code np.random.seed(0) enables you to provide a seed (i.e., the starting input) for NumPy’s pseudo-random number generator.
NumPy then uses the seed and the pseudo-random number generator in conjunction with other functions from the numpy.random namespace to produce certain types of random outputs.
Ultimately, creating pseudo-random numbers this way leads to repeatable output, which is good for testing and code sharing.
Having said all of that, to really understand numpy.random.seed, you need to have some understanding of pseudo-random number generators.
You can use numpy.random.seed(0), or numpy.random.seed(42), or any other number.
For the most part, the number that you use inside of the function doesn’t really make a difference.
You just need to understand that using different seeds will cause NumPy to produce different pseudo-random numbers. The output of a numpy.random function will depend on the seed that you use.
Here’s a quick example. We’re going to use NumPy random seed in conjunction with NumPy random randint to create a set of integers between 0 and 99.
In the first example, we’ll set the seed value to 0.
np.random.seed(0)
np.random.randint(99, size = 5)
Which produces the following output:
array([44, 47, 64, 67, 67])
Basically, np.random.randint generated an array of 5 integers between 0 and 99. Note that if you run this code again with the exact same seed (i.e. 0), you’ll get the same integers from np.random.randint.
Next, let’s run the code with a different seed.
np.random.seed(1)
np.random.randint(99, size = 5)
OUTPUT:
array([37, 12, 72, 9, 75])
Here, the code for np.random.randint is exactly the same … we only changed the seed value. Here, the seed is 1.
With a different seed, NumPy random randint created a different set of integers. Everything else is the same. The code for np.random.randint is the same. But with a different seed, it produces a different output.
Ultimately, I want you to understand that the output of a numpy.random function ultimately depends on the value of np.random.seed, but the choice of seed value is sort of arbitrary.
Do I always need to use numpy random seed?
The short answer is, no.
If you use a function from the numpy.random namespace (like np.random.randint, np.random.normal, etc) without using NumPy random see first, Python will actually still use numpy.random.seed in the background. NumPy will generate a seed value from a part of your computer system (like /urandom on a Unix or Linux machine).
So essentially, if you don’t set a seed with numpy.random.seed, NumPy will set one for you.
In this tutorial, I’ll show you how to make a simple matplotlib line chart. Essentially, I’ll show you how to use the plt.plot function from pyplot to create a line chart.
Line charts are a little confusing in Python
I’ll be honest. Creating a line chart in Python is a little confusing to beginners.
If you’ve been trying to create a decent line chart in Python and just found yourself confused, don’t worry. Many beginners feel a little confused.
Part of the problem is that the tools for creating data visualizations in Python are not as well designed as some modern tools like ggplot in R. If you’ve come from R, you might find that creating a line chart is actually more challenging in Python.
Another issue is that many of the examples online for how to make a line chart with matplotlib are bad. Many of the examples are either out of date, or more complex than they need to be.
Those things being the case, this blog post will try to clear up some of the confusion and introduce you to some basic syntax to get you started.
The contents of this tutorial
Although this blog post won’t show you everything about data visualization with matplotlib, it will show you some of the essential tools so you can make a basic line chart. It will give you a foundation that you can build on as you continue to learn.
The tutorial has several different sections that will help you understand creating line charts with pyplot.
If you need help with something specific, you can click on one of the links. The links will take you directly to the relevant section within this blog post.
On the other hand, if you’re just getting started with data visualization in Python, it’s probably a good idea to read the entire blog post. Instead of just trying to copy and paste some code, it’s good to read through everything so you know how it all works.
A quick introduction to matplotlib
Before we get started actually creating line charts, let’s talk about matplotlib first.
If you’re just getting started with data science in Python, you’ve probably heard about matplotlib, but you might not know what it is.
What is matplotlib?
Matplotlib is a module for Python that focuses on plotting and data visualization. It’s very flexible and it provides you with tools for creating almost any data visualization you can think of.
On the other hand, it was initially released in 2003, and some of the techniques for creating visualizations feel out of date.
Specifically, the syntax for matplotlib is a little “low level” in some cases, and this can make it difficult to use for many beginners.
However, one thing that can make matplotlib easier to use is the pyplot sub-module.
What is pyplot?
Pyplot is part of matplotlib … it is a sub-module within the overall matplotlib module.
The pyplot sub-module provides a set of “convenience functions” for creating common data visualizations and performing common data visualization tasks. Essentially, pyplot provides a set of relatively simple tools for creating common charts like the bar chart, scatter plot, and line chart.
Pyplot still isn’t perfect (it can still be a little confusing to beginners), but it simplifies the process of creating some data visualizations in Python.
Now that you know a little more about matplotlib and pyplot, let’s examine the syntax to create a line chart.
The syntax of the matplotlib line chart
To create a line chart with pyplot, you typically will use the plt.plot function.
The name of the function itself often confuses beginners, because many of the other functions in pyplot have names that directly relate to the chart that they create. For example, you create a bar chart in pyplot by using the plt.bar function. You create histograms by using the plt.hist function. And you create scatter plots in matplotlib by using the plt.scatter function.
You’d think that to create a line chart, there would be a function called “plt.line()“, right?
No. That’s not how you create a line chart with pyplot.
To create a matplotlib line chart, you need to use the vaguely named plt.plot() function.
That being said, let’s take a look at the syntax.
The plt.plot function has a lot of parameters … a couple dozen in fact.
But here in this tutorial we’re going to simplify things and just focus on a few: x, y, color, and linewidth.
I want to focus on these parameters because they are the one’s you will probably use most often. Also, by focusing down on a few, you can make it easier to learn the syntax. If you’re just getting started, you really need to simplify things as much as possible until you learn and memorize the basics. Once you learn the basics, then make things more complex.
Ok. Let me explain the parameters I mentioned, one at a time.
The basic parameters of plt.plot
Here, I’ll explain four important parameters of the plt.plot function: x, y, color, and linewidth.
y
The y parameter allows you to specify the y axis coordinates of the points along the line you want to draw.
Here’s a very simple example. The following line has been created by connecting four points. The y axis coordinates of these points are at 2, 5, 4, and 8.
The plt.plot function basically takes those points and connects them with line segments. That’s what the function does.
We tell plt.plot the position of those points by passing data to the y parameter.
Typically, we will pass data to this parameter in the form of an array or an array-like object. You can use a Python list or similar objects like NumPy arrays.
Keep in mind, the y parameter is required.
I’ll show you exactly how to use this parameter in the examples section of this tutorial.
x
The x parameter is similar to the y parameter.
Essentially, the x parameter enables you to supply the x axis positions of the points on the line.
So let’s take another look at the example we saw in the last section:
Here, the line is made up of segments that connect four points.
The points are at locations 1, 2, 3, and 4 on the x axis.
We tell the plt.plot function these x axis locations by using the x parameter.
Typically, we’ll supply these x axis positions in the form of a Python list. More broadly though, we can supply the x axis positions in the form of any array-like object … a list, a NumPy array, etc.
Keep in mind that the x parameter is optional. That means that although you need to supply values for the y parameter, you do not need to supply values for the x parameter. If you don’t provide any data to the x parameter, matplotlib will assume that the x axis positions are [0, 1, 2, ... n - 1], if you have n points. Basically, the x axis positions will just be 0 to n – 1.
Here in this tutorial, we are mostly going to omit the arguments to the x parameter.
color
The color parameter does what you probably expect that it does … it changes the color of the line.
There are a few ways to define the color that you want to use and the easiest way is to use a “named” color. Named colors are colors like “red”, “blue”, “yellow”, and so on. Python and matplotlib recognize several dozen “named” colors. They aren’t limited to the simple colors that we commonly talk about, but there are colors like “crimson”, “wheat”, “lavender”, and more. It’s a good idea to become familiar with a few of the named colors.
Having said that, I strongly prefer to use hexideceimal colors in my data visualizations. Hex colors allow for a lot more flexibility and they allow you to customize your plots to a much larger degree. Essentially, with hex colors, you can “mix your own” colors.
On the other hand, although hex colors allow for more flexibility, they are harder to use. You’ll also need to learn about how hexidecimal numbers work in order to really understand hex colors.
Given that hex colors are a little more complicated we’re not really going to cover them here. I’ll explain hex colors in a future blog tutorial.
linewidth
The linewidth parameter is also fairly self explanatory. It controls the width of the line that’s plotted.
I’ll show you an example in the examples section below to show you how to use this to increase or decrease the width of the plotted line.
Examples: how to make a line chart plot in matplotlib
Now that we’ve gone over a few of the important parameters of the plt.plot function, let’s look at some concrete examples of how to use the plt.plot function.
Here, I’ll show you a simple example of how to use the function, and I’ll also show you individual examples of how to use the parameters that I explained earlier in this tutorial.
Run this code before you get started
Before you start working with the examples themselves, you need to run some code.
Import modules
First, you need to run some code to import a few Python modules. You need to import the pyplot submodule of matplotlib. You also need to import the seaborn module. We’ll be using that later to do some formatting.
# IMPORT MODULES
import matplotlib.pyplot as plt
import seaborn as sns
Notice that we’re importing these modules with different names. For example, we’re importing the pyplot module as plt. We’re importing the seaborn module as sns. We’re essentially giving these modules “nicknames” … these are aliases that we can use to simplify and shorten our code. You’ll see these later as we call the functions from pyplot and seaborn.
Create dataset
After you import the modules, you’ll need to get the data that we’re going to use.
For these examples, we’re going to use stock price data from the company Tesla, Inc. The data is from the IPO in June of 2010 to the fall of 2018.
# GET DATA FROM TXT FILE
tsla_stock_data = pd.read_csv("https://www.sharpsightlabs.com/datasets/TSLA_start-to-2018-10-26_CLEAN.txt")
#--------------------
# EXTRACT CLOSE PRICE
#--------------------
tsla_close_price = tsla_stock_data.close_price
As noted above, most of the parameters that we’re going to work with require you to provide a sequence of values. Here, we’ve imported the date using the read_csv() function from pandas, and then extracted one variable, tsla_close_price. The way that we’ve extracted this data, the tsla_close_price is actually a Pandas series.
Having said that, the plt.plot() function can also operate on Python lists, tuples, and array-like objects.
A quick note about learning and practice
In the following examples, we’re going to keep things very simple.
This is a general principle that you should remember when you’re learning a new programming language or skill. Start simple. Break everything down and isolate individual techniques.
Once you’ve broken down the individual techniques, study them and practice them.
Then, after you’ve mastered the basic techniques, you can start to combine those techniques into more complicated structures.
Start simple and then increase the complexity.
With that in mind, let’s start to look at a few very simple examples of how to make a line chart with matplotlib.
How to make a simple line chart with matplotlib
For our first example, we’re going to start very simple. This will be as simple as it gets.
We’re basically going to plot our Tesla stock data with plt.plot.
To do this, we’ll call the plt.plot() function with the tsla_close_price data as the only argument.
#-----------------
# SIMPLE LINE PLOT
#-----------------
plt.plot(tsla_close_price)
And here is the output:
There’s nothing fancy about this, but it’s a decent rough draft, and it’s easy to understand.
Let’s break it down.
We’ve called the plt.plot() function. Inside of the function, we see the data set name tsla_close_price, which is the daily closing price of Tesla stock from June of 2010 to the fall of 2018.
Notice that we didn’t explicitly refer to any of the parameters. You’ll often see this in Python code. It’s very common for Python programmers to leave the names of the parameters out of the syntax.
So which parameter is being used here?
The code is implicitly using the y parameter. When you supply a single argument to the plt.plot function, the function assumes that the argument you supply should be connected to the y parameter. This is effectively like setting y = tsla_close_price.
With that in mind, you can understand what this plot shows. The y axis essentially shows the value of the closing price on any given day. Each observation in tsla_close_price is effectively a point on the line, and the plt.plot function just creates a line that connects them.
What about the x axis? We actually didn’t supply any data to the x parameter, so the plt.plot function just generated x axis values from 0 to n – 1 (where n is the total number of observations in the tsla_close_price data).
We can interpret the x axis as the number of days since the IPO. That’s not typically what we’d show … in many cases we’d probably show the date on the x axis. However, I wanted to make this example as simple as possible. Remember my recommendation a few sections ago: when you’re learning syntax, start by studying very simple examples. This example is as simple as it gets.
Change the color of the line
Next, let’s increase the complexity of the chart just a little bit.
Here, we’re going to change the color of the line.
To do this, we’ll use the color parameter.
#------------------
# CHANGE LINE COLOR
#------------------
plt.plot(tsla_close_price, color = 'red')
Which produces the following chart:
This is very simple. We essentially created this with the same code as the previous example, but we added an extra piece of syntax. Essentially, we added the syntax color = 'red', which (surprise) turns the line to a red color.
As you’re playing with this syntax, try out different colors. You can change the color to ‘green’, ‘yellow’, or another of the matplotlib colors. Part of learning data visualization is learning which colors to use. To learn this, you need to try out different aesthetic values, and see what looks good.
Change the width of the line
Now, I’ll show you how to change the width of the line.
To do this, you need to use the linewidth parameter.
This is very straight forward. All we need to do is provide a numeric argument to the linewidth parameter (an integer or decimal number).
By default, the linewidth parameter is typically set to 1.5.
In the charts so far, this has made the line just slightly too thick, so I’m going to reduce it to 1.
#------------------
# CHANGE LINE WIDTH
#------------------
plt.plot(tsla_close_price, linewidth = 1)
And here’s the output:
The difference is subtle, but I think this linewidth looks better for this particular chart.
When you create your own line charts, I recommend playing around with the width of the line. The “right” line width will depend on the chart that you’re making. For some charts you’ll want a thicker line and for others you’ll want a thinner line. As you learn and master data visualization, you’ll simply need to develop your judgement about when to use a thick or thin line.
Having said that, actually setting the width is easy enough. When you’re using pyplot, just use the linewidth parameter.
Improve the formatting of your pyplot line chart
One problem I have with the charts that we’ve made so far is that the formatting is a little ugly.
Unfortunately, this is one of the downsides of standard matpotlib … the default settings create charts that are a little unrefined. The default charts are okay if you’re just doing basic data analysis for personal consumption; they are okay if you aren’t going to show them to anyone important. But if you plan to present your work to anyone important – say important colleagues or a management team – the basic charts aren’t great. You should present charts that have a little more polish.
That being said, in this section, I’ll show you a quick trick for improving the formatting of your Python line chart.
To do this, we’re going to use a simple function from the seaborn module.
Use seaborn formatting to improve your charts
The seaborn module is a data visualization module for Python. I won’t explain seaborn too much here, but at a high level, seaborn works along side and on top of matplotlib.
We’re going to use a special function from the seaborn package to improve our charts: the seaborn.set() function.
Import seaborn
To use the sns.set() function, you’ll need to import seaborn into your working environment.
The following code will import seaborn with the alias sns.
# import seaborn module
import seaborn as sns
Use seaborn.set() to change default formatting
Once you have seaborn imported, you can use seaborn.set() function.
To use it, you simply need to call the function by itself.
Because we’ve imported seaborn as sns, we can call the function as sns.set().
#set plot defaults using seaborn formatting
sns.set()
Calling the function this way will change the formatting for your matplotlib charts.
Let’s take a look.
Here, we’re simply going to replot our line chart.
#----------------------------------------
# PLOT LINE CHART WITH SEABORN..
In this tutorial, I’ll show you how to use the loc method to select data from a Pandas dataframe.
If you’re new to Pandas and new to data science in Python, I recommend that you read the whole tutorial. There are some little details that can be easy to miss, so you’ll learn more if you read the whole damn thing.
Again though, I recommend that you slow down and learn step by step. That’s the best way to rapidly master data science.
Ok. Quickly, I’m going to give you an overview of the Pandas module. The specifics about loc[] will follow just afterwards.
A quick refresher on Pandas
To understand the Pandas loc method, you need to know a little bit about Pandas and a little bit about DataFrames.
What is Pandas?
Pandas is a data manipulation toolkit in Python
Pandas is a module for data manipulation in the Python programming language.
At a high level, Pandas exclusively deals with data manipulation (AKA, data wrangling). That means that Pandas focuses on creating, organizing, and cleaning datasets in Python.
However, Pandas is a little more specific.
Pandas focuses on DataFrames. This is important to know, because the loc technique requires you to understand DataFrames and how they operate.
That being the case, let’s quickly review Pandas DataFrames.
This row-and-column format makes a Pandas DataFrame similar to an Excel spreadsheet.
Notice in the example image above, there are multiple rows and multiple columns. Also notice that different columns can contain different data types. A column like ‘continent‘ contains string data (i.e., character data) but a different column like ‘population‘ contains numeric data. Again, different columns can contain different data types.
But, within a column, all of the data must have the same data type. So for example, all of the data in the ‘population‘ column is integer data.
Pandas dataframes have indexes for the rows and columns
Pandas DataFrames have another important feature: the rows and columns have associated index values.
Take a look. Every row has an associated number, starting with 0. Every column also has an associated number.
These numbers that identify specific rows or columns are called indexes.
Keep in mind that all Pandas DataFrames have these integer indexes by default.
Integer indexes are useful because you can use these row numbers and column numbers to select data and generate subsets. In fact, that’s what you can do with the Pands iloc[] method. Pandas iloc enables you to select data from a DataFrame by numeric index.
But you can also select data in a Pandas DataFrames by label. That’s really important for understanding loc[], so let’s discuss row and column labels in Pandas DataFrames.
Pandas dataframes can also have ‘labels’ for the rows and columns
In addition to having integer index values, DataFrame rows and columns can also have labels.
Unlike the integer indexes, these labels do not exist on the DataFrame by default. You need to define them. (I’ll show you how in a moment.)
When you set them up, the row and column labels look something like this:
Importantly, if you set the labels up right, you can use these labels to subset your data.
And that’s exactly what you can do with the Pandas loc method.
The loc method: how to select data from a dataframe
So now that we’ve discussed some of the preliminary details of DataFrames in Python, let’s really talk about the Pandas loc method.
The Pandas loc method enables you to select data from a Pandas DataFrame by label.
It allows you to “locate” data in a DataFrame.
That’s where we get the name loc[]. We use it to locate data.
It’s slightly different from the iloc[] method, so let me quickly explain that.
How is Pandas loc different from iloc?
This is very straightforward.
The loc method locates data by label.
The iloc method locates data by integer index.
I’m really not going to explain iloc here, so if you want to know more about it, I suggest that you read our Pandas iloc tutorial.
The syntax of the Pandas loc method
Now that you have a good understanding of DataFrame structure, DataFrame indexes, and DataFrame labels, lets get into the details of the loc method.
Here, I want to explain the syntax of Pandas loc.
How does it work?
If you’re familiar with calling methods in Python, this should be very familiar.
Essentially, you’re going to use “dot notation” to call loc[] after specifying a Pandas Dataframe.
So first, you’ll specify a Pandas DataFrame object.
Then, you’ll type a dot (“.“) ….
… followed by the method name, loc[].
Inside of the loc[] method, you need to specify the labels of the rows or columns that you want to retrieve.
It’s important to understand that you can specify a single row or column. Or you can also specify a range of rows or columns. Specifying ranges is called “slicing,” and it’s an important tool for subsetting data in Python. I’ll explain more about slicing later in the examples section of this tutorial.
An important note about the ‘column’ label
There’s one important note about the ‘column’ label.
If you don’t provide a column label, loc will retrieve all columns by default.
Essentially, it’s optional to provide the column label. If you leave it out, loc[] will get all of the columns.
Examples of Pandas loc
Ok. Now that I’ve explained the syntax at a high level, let’s take a look at some concrete examples.
In this examples section, we’re going to focus on simple examples. This is important. When you’re learning, it’s very helpful to work with simple, clear examples. Don’t try to get fancy too early on. Learn the technique with simple examples and then move on to more complex examples later.
Before we actually get into the examples though, we have two things we need to do. We need to import Pandas and we need to create a simple Pandas DataFrame that we can work with.
Import modules
First, we’ll just import Pandas.
We can do this with the following code.
#===============
# IMPORT MODULES
#===============
import pandas as pd
Note that we’re importing Pandas with the alias pd. This makes it possible to refer to Pandas as pd in our code, which simplifies things a little.
There’s actually three steps to this. We need to first create a Python dictionary of data. Then we need to apply the pd.DataFrame function to the dictionary in order to create a dataframe. Finally, we’ll specify the row and column labels.
Here’s the step where we create the Python dictionary:
Notice that we need to store the output of set_index() back in the DataFrame, country_data_df by using the equal sign. This is because set_index() creates a new object by default; it doesn’t modify the DataFrame in place.
Quickly, let’s examine the data with a print statement:
print(country_data_df)
continent GDP population
country
USA North America 19390604 322179605
China Asia 12237700 1403500365
Japan Asia 4872137 127748513
Germany Europe 3677439 81914672
UK Europe 2622434 65788574
India Asia 2597491 1324171354
You can see the row-and-column structure of the data. There are 3 columns: continent, GDP, and population. Notice that the “country” column is set aside off to the left. That’s because the country column has actually become the row index (the labels) of the rows.
Visually, we can represent the data like this:
Essentially, we have a Pandas DataFrame that has row labels and column labels. We’ll be able to use these row and column labels to create subsets.
With that in mind, let’s move on to the examples.
Select a single row with the Pandas loc method
First, I’m going to show you how to select a single row using loc.
Example: select data for USA
Here, we’re going to select all of the data for the row USA.
To do this, we’ll simply call the loc[] method after the dataframe:
country_data_df.loc['USA']
Which produces the following output:
continent North America
GDP 19390604
population 322179605
Name: USA, dtype: object
This is fairly straightforward, but let me explain.
We’re using the loc[] method to select a single row of data by the row label. The row label for the first row is ‘USA,’ so we’re using the code country_data_df.loc['USA'] to pull back everything associated with that row.
Notice that using loc[] in this way returns the values for all of the columns for that row. It tells us the continent of USA (‘North America‘), the GDP of USA (19390604), and the population of the row for USA (322179605).
The loc method returns all of the data for the row with the label that we specify.
Example: select data for India
Here’s another example.
Here, we’re going to select all of the data for India. In other words, we’re going to select the data for the row with the label India.
Once again, we’ll simply use the name of the row label inside of the loc[] method:
country_data_df.loc['India']
Which produces the following output:
continent Asia
GDP 2597491
population 1324171354
Name: India, dtype: object
As you can see, the code country_data_df.loc['India'] returns all of the data for the ‘India‘ row.
Now that I’ve shown you one way to select data for a single row, I’m going to show you an alternate syntax.
Select a single row (alternate syntax)
There’s actually another way to select a single row with the loc method.
It’s a little more complicated, but it’s relevant for retrieving “slices” of data, which I’ll show you later in this tutorial.
Here, we’re going to call the loc[] method using dot notation, just like we did before.
Inside of the loc[] method, the first argument will be the label associated with the row we want to return. Here, we’re going to retrieve the data for USA, so the first argument inside of the brackets will be ‘USA.’
After that though, the code will be a little different. After the row label that we want to return, we have a comma, followed by a colon (‘:‘).
The full line of code looks like this:
country_data_df.loc['USA',:]
Which produces the following:
continent North America
GDP 19390604
population 322179605
Name: USA, dtype: object
Once again, this code has pulled back the row of data associated with the label ‘USA.’
The output of this code is effectively the same as the code country_data_df.loc['USA']. The difference is that we’re using a colon inside of the brackets now (i.e., country_data_df.loc['USA',:]).
Why?
Remember from earlier in this tutorial when I explained the syntax: when we use the Pandas loc method to retrieve data, we can refer to a row label and a column label inside of the brackets.
In the code country_data_df.loc['USA',:], ‘USA‘ is the row label and the colon is functioning as the column label.
But instead of referring to a specific column, the colon basically tells Pandas to retrieve all columns.
The output though is basically the row associated with the row label ‘USA‘:
Keep this syntax in mind … it will be relevant when we start working with slices of data.
This tutorial will show you how to add ggplot titles to data visualizations in R.
It will show you step by step how to add titles to your ggplot2 plots. We’ll talk about how to:
add an overall plot title to a ggplot plot
add a subtitle in ggplot
change the x and y axis titles in ggplot
add a plot caption in ggplot
To add titles in ggplot, you need to understand how ggplot2 works
There are several ways to add titles to ggplot2 visualizations, but the primary way to add titles in ggplot2 is by using the labs() function.
Later in this post, I’ll explain the syntax of the labs() function and show you some examples.
But first, I want to give you a quick review of ggplot2.
A quick review of ggplot2
If you’re reading this blog post, you probably know a little bit about ggplot2.
But to understand how to add titles in ggplot2, it helps to really understand how the ggplot2 system works. With that in mind, to make sure you have the background that you need, I’m going to quickly explain how ggplot2 works.
Note: If you want to skip this section and move straight to the section about ggplot2 titles, you can click on this link to skip ahead.
ggplot2 is a package for the R programming language that focuses on data visualization. It gives you a toolkit for creating data visualizations in R.
Keep in mind, ggplot2 is the name of the actual package, but many people use the words ggplot and ggplot2 interchangeably. So, when I’m talking about the package, sometimes write “ggplot” and sometimes write “ggplot2.” For example, in this post, we’re talking about how to add a “ggplot title” … ggplot is just a nickname for “ggplot2”. Remember, people use the terms interchangeably.
ggplot2 has different functions for different tasks
The ggplot2 visualization package is structured in a highly modular way.
What I mean by this is that the package has many different functions, and each function “does one thing.”
So there’s one function to initiate a plot. There’s one function to add lines to a plot. There is a different function to add titles. Etcetera.
So for example, if you want to create a line chart, you’ll use the ggplot() function to initiate plotting.
Inside of the ggplot() function, there’s the aes() function, which enables you to specify which variables should go on which axes of chart.
There’s a separate function, geom_line(), which actually draws the lines.
And if you want to add a title to your plot, there’s a separate function for that too.
Essentially, ggplot2 has discrete functions for almost everything you need to do to create a data visualization.
I want to explain that to you, because this is different than many other programming languages. Data visualization in other programing langues (like Python) is not based on discrete functions in this way.
I’m pointing out the modular design of ggplot2 because it’s somewhat relevant to how we add titles to a ggplot2 visualization.
The labs function adds ggplot titles
As I just mentioned, if you want to add a title to your ggplot2 plot, you need to call an additional function.
As it turns out, there’s actually a few functions that enable you to add titles to your plot. The ggtitle() function enables you to add an overall plot title. The xlab() function adds an x-axis title and the ylab() function enables you to add a y-axis title.
However, the labs() function can do all of these.
In the rest of this blog post, we’ll be using the labs function to add titles to our ggplot2 plots.
The syntax of the ggplot labs function
Let’s take a look at the syntax of the labs function and how it works.
As I mentioned previously in this tutorial, the ggplot2 system is highly modular.
What this means is that in order to add a title to a ggplot2 plot, you first need to create the plot itself, and then use the labs function after that.
So for example, let’s say that you want to add a title to a line chart.
You’ll first use the ggplot() function, the aes() function, and geom_line() to create a line chart.
After you create your plot with ggplot(), you can add the syntax for the labs function after that:
Notice as well that in this particular example, there is a ‘+‘ after geom_line().
You always need to use a plus symbol (‘+‘) when you add a title using the labs function. This is part of the modular system of ggplot2. You’ll call the ggplot() function, and whenever you call an additional function after ggplot() to modify the plot, you’ll almost always need to use the plus sign.
A detailed explanation of the labs function
Now that you’ve seen where the ggplot labs function fits into the overall ggplot2 system, let’s take a closer look at the internal syntax of the labs function.
When you use the function, you’ll simply call the function by typing labs(), just as I explained above.
But inside of the labs function, there are several parameters that enable you to modify different parts of the plot.
Let’s take a look at each of those parameters, so you know what each one does.
The parameters of the ggplot labs function
There are 5 main parameters that you should know about for the labs function:
title
subtitle
x
y
caption
There are also a few others (like tag), but the 5 listed above are the ones you’ll probably really use.
Importantly, each of those parameters controls a different title (AKA, “label”) of a ggplot visualization, as seen here:
So essentially, you use the parameters to add titles or labels to specific parts of a ggplot visualization.
Let’s quickly examine each parameter.
title
The title parameter adds an overall plot title at the top of the visualization.
subtitle
The subtitle parameter adds a subtitle underneath the plot title.
x
The x parameter adds an x-axis title along the x-axis, at the bottom of the plot.
y
The y parameter adds a y-axis title along the y-axis, along the left hand side of the plot.
caption
The caption parameter adds a small plot caption at the bottom of the plot.
We will typically use this to add a small note about the plot or about the data.
These are the essential parameters of the ggplot labs function.
They’re pretty easy to use, but as always, it’s best to see how they work with real examples.
With that in mind, now let’s take a look at how to use the labs function to add labels and titles to different parts of a ggplot2 chart.
Examples: changing ggplot titles
Here in this examples section, I’ll show you simple, concrete examples of how to use the ggplot labs function to add titles.
Load and install the tidyverse package
But before you can run the code and work with these examples, you’ll need to install and load the tidyverse package.
Keep in mind that we’ll primarily be working with the ggplot2 package. However, the tidyverse package actually includes ggplot2. When you install and load tidyverse, ggplot2 will automatically be installed and loaded as well. We need to install the tidyverse because we’ll need some other tools from the tidyverse package to get our dataset.
So if you haven’t already done so, install the tidyverse package.
You can install the tidyverse package in RStudio by going to the “Tools” menu, and selecting Tools >> Install Packages. This will bring up a message box that will enable you to install the package:
After you’ve installed the tidyverse package, you can load it with the library() function as follows:
library(tidyverse)
Just type that into your program and you should be ready to go.
Get dataset
Next, you’re going to need to get the data that we’ll be working with.
In the following examples, we’ll be working with some data of Tesla’s stock price.
The dataset is contained in a .csv file and is located at a specific webpage.
You’ll need to get that data and import it into R as an R data frame.
To do this, you’ll need to use the read_csv() function from the readr package. (Note that the readr package is one of the packages from the tidyverse package.)
The code to import the data into a dataframe is as follows:
In this tutorial, I’ll show you how to make a matplotlib scatter plot.
The scatter plot is a relatively simple tool, but it’s also essential for doing data analysis and data science.
Having said that, if you want to do data science in Python, you really need to know how to create a scatter plot in matplotlib. You should know how to do this with your eyes closed.
This tutorial will show you how to make a matplotlib scatter plot, and it will show you how to modify your scatter plots too.
Overall, the tutorial is designed to be read top to bottom, particularly if you’re new to Python and want the details of how to make a scatter plot in Python. Ideally, it’s best if you read the whole tutorial.
Having said that, if you just need quick help with something, you can click on one of the following links. These links will bring you to the appropriate section in the tutorial.
Again though, if you’re a relative beginner and you have the time, I recommend that you read the full tutorial. Everything will make more sense that way.
Ok. Before I show you how to make a scatter plot with matplotlib, let me quickly explain what matplotlib is.
A quick introduction to matplotlib
Matplotlib is a data visualization module for the Python programming language. It provides Python users with a toolkit for creating data visualizations.
Some of those data visualizations can be extremely complex. You can use matplotlib to create complex visualizations, because the syntax is very detailed. This makes the syntax very adaptable for different visualization problems.
On the other hand, the complex syntax of matplotlib can make it more complicated to quickly create simple data visualizations.
This is where pyplot comes in.
What is pyplot
When you start working with matplotlib, you might read about pyplot.
What is pyplot?
To put it simply, pyplot is part of matplotlib. Pyplot is a sub-module of the larger matplotlib module.
Specifically, pyplot provides a set of functions for creating simple visualizations. For example, pyplot has simple functions for creating simple plots like histograms, bar charts, and scatter plots.
Ultimately, the tools from pyplot give you a simpler interface into matplotlib. It makes visualization easier for some relatively standard plot types.
As I mentioned, one of those plots that you can create with pyplot is the scatter plot.
Let’s take a look at the syntax.
The syntax of the matplotlib scatter plot
Creating a scatter plot with matplotlib is relatively easy.
To do this, we’re going to use the pyplot function plt.scatter().
For the most part, the synax is relatively easy to understand. Let’s take a look.
First of all, notice the name of the function. Here, we’re calling the function as plt.scatter(). Keep in mind that we’re using the syntax plt to refer to pyplot. Essentially, this code assumes that you’ve imported pyplot with the code import matplotlib.pyplot as plt. For more information on that, see the examples below.
To create a scatter plot with matplotlib though, you obviously can’t just call the function. You need to use the parameters of the function to tell it exactly what to plot, and how to plot it.
With that in mind, let’s take a look at the parameters of the plt.scatter function.
The parameters of plt.scatter
Pyplot’s plt.scatter function has a variety of parameters that you can manipulate … nearly a dozen.
The large number of parameters can make using the function a little complicated though.
So in the interest of simplicity, we’re only going to discuss five of them: x and y, c, s, and alpha.
Let’s talk about each of them.
x and y
The x and y parameters of plt.scatter are very similar, so we’ll talk about them together.
Essentially, they are the x and y axis positions of the points you want to plot.
The data that you pass to each of these should be in an “array like” format. In Python, structures with “array like” formats include things like lists, tuples, and NumPy arrays.
Commonly, you’ll find that people pass data to these parameters in the form of a Python list. For example, you might set x = [1,2,3,4,5].
In this tutorial though, we’ll work with NumPy arrays. You’ll see this later in the examples section, but essentially, we’ll pass values to the x and y parameters in the form of two NumPy arrays.
c
The c parameter controls the color of the points.
There are several ways to manipulate this parameter.
First, you can set the c parameter to a “named color.” Named colors are colors like “red,” “green,” “blue,” and so on. Python has a large number of named colors, so if you want something specific, take a look at the options and use one in your plot.
You can also set the c parameter using a hexidecimal color. For example, you can set c = "#CC0000" to set the color of the points to a sort of “fire engine red” color. Using hex colors is great, because they can give you very fine-grained control over the colors in your visualization. On the other hand, hexidecimal colors can be a little bit complicated for beginners. That being the case, we’re not going to really cover hex colors in this tutorial.
It’s also possible to create a color mapping for your points, such that the color of the points varies according to some variable. Unfortunately, this is somewhat complicated for a beginner. So in the interest of simplicity, I won’t explain it here. If you’re really interested in complex visualization with more visually appealing colors, I strongly recommend using R’s ggplot2 system instead.
s
The s parameter controls the size of the points.
The default value is controlled by the lines.markersize value in the rcParams file.
We’re not going to work extensively with the s parameter, but I’ll show you a simple example of how it works in the examples below.
alpha
Finally, the alpha parameter controls the opacity of the points.
This must be a value between 0 and 1 (inclusive), where 1 is fully opaque and 0 is fully transparent.
Examples: how to make a scatter plot in matplotlib
Now that you understand the syntax and the parameters of the plt.scatter function, let’s work through some examples.
Run this code before you get started
One last thing though before you try to run the examples.
… you’ll need to run some code to get these examples to work properly.
Import modules
First, you’ll need to import a few modules into your working environment. The following code will import matplotlib, numpy and pyplot.
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
Create dataset
Also, you need to create some data.
We’re essentially going to create two vectors of data.
We’ll create the first, x_var, by using the np.arange function. This data, x_var, essentially contains the integer values from 0 to 49.
The second variable, y_var, is the same value of x_var with a little random noise added in with the np.random.normal function.
You’ll see what the data looks like in a minute. The whole point of this tutorial is that we’re going to plot it! But essentially, when we plot them together, they will look like highly correlated linear data.
Ok, now that we have our data, let’s plot it.
How to make a simple scatter plot with matplotlib
We’ll start off by making a very simple scatter plot.
To do this, we’re going to call plt.scatter() and set x = x_var and y = y_var.
# PLOT A SIMPLE SCATTERPLOT
plt.scatter(x = x_var, y = y_var)
And here is the output:
Let me explain a few things about the code and the output.
First, notice the code. We mapped x_var to the x axis and we mapped y_var to the y axis.
You can see that this directly translates into how the points are plotted. For any given point in the scatter plot, the x axis value comes from the x_var variable, and the y axis value comes from the y_var variable. Said differently, the locations of the points are contained in the variables x_var and y_var.
I also want to note that you don’t need to explicitly need to type the parameters x and y. For example, you could remove x = and y = from the code, and it would still work. Like this:
plt.scatter(x_var, y_var)
This code works the same as plt.scatter(x = x_var, y = y_var). They are operationally identical. If you remove x = and y = from the code, Python still knows that you are passing x_var and y_var to the x and y parameter. It essentially knows that the first variable should be mapped to x and the second should be mapped to y. This is known as defining argument values by position. It’s very common to see that in code, so I want you to understand it.
A basic matplotlib scatter plot is a little “ugly”
At this point, I need to point out that a default matplotlib scatter plot is a little plain looking. It’s a little unrefined.
An unrefined chart is fine if you’re doing exploratory data analysis for personal consumption. But if you need to create a chart and show it to anyone important – like a management team in a business – this chart is unrefined. It lacks polish.
Like it or not, that lack of polish will reflect a little poorly on you. You can deny it all you want, but it can be very useful to learn how to polish your charts and make them look more professional. I’ll show you how in an example further down in this tutorial.
Change the color of the points
Although there is a lot we would need to do to make the basic scatter plot look better, changing the color of the points is a simple way to improve the aesthetics of the chart.
Let me show you how.
As noted earlier in this tutorial, you can modify the color of the points by manipulating the c parameter.
Multiple ways to set the color of points
There are actually several different ways to modify the c parameter to change the color of the points.
The two primary ways to do this are to set the parameter to a “named color” or to set the parameter to a “hex color.”
Here in this tutorial, I’ll show you how to set the color of the points to a “named color.” Hex colors are a little more complicated, so I’m not going to explain them here.
We can change the color of the points in our scatter plot by setting the c parameter to a “named color.”
What are named colors? This is very simple. Named colors are colors like “red,” “green,” and “blue.” Python has a pretty long list of named colors. I recommend that you become familiar with a few of them, so you have a few that you can use regularly in your plots.
When you know what color you want to use for your points, provide that color as the argument to the c parameter.
For example, if you want to set the color of the points to “red” you can use the code c = 'red' inside of plt.scatter.
Here’s the code to do that:
plt.scatter(x_var, y_var, c = 'red')
The code produces the following output:
As you can see, this code has changed the color of the points to red.
This chart still lacks polish, but by using the c parameter, we now have a little more control over the aesthetics of our scatter plot.
Change the size of the points
You can also change the size of the points.
You can do that by using the s parameter.
Changing the size is very similar to changing the color. Just provide a value.
plt.scatter(x_var, y_var, s = 120)
And here’s the output:
As you can see, the size of the points is larger than the size of the points in our simple scatter plot.
The value that you give to the s parameter will be the point size in points**2.
Make your matplotlib scatter plot look more “professional”
As I mentioned earlier, the default formatting for pyplot plots is a little unrefined.
Again, that’s not a big deal if you’re just exploring data on your laptop and don’t intend to show it to anyone important. The default matplotlib formatting is OK for rough drafts.
But I definitely think you should “polish” your charts if you need to show them to anyone important. For example, if you work in a business environment and you need to present an analysis to a high-level management team, you’ll want your charts to be polished and aesthetically pleasing. The appearance of your visualizations matter. Don’t ignore it.
That being the case, let me show you a quick way to improve the look of your pyplot scatter plots.
We’re going to use a function from the seaborn module to change some of our plot formatting.
To use seaborn, we’ll need to import the seaborn module. You can do that with the following code.
# import seaborn module
import seaborn as sns
Now that seaborn is installed, we’re going to use the seaborn.set() function to re-set the plot defaults:
#set plot defaults using seaborn formatting
sns.set()
After running sns.set(), you can re-plot your data, and you’ll notice that it looks quite a bit better.
#plot scatter plot with matplotlib.pyplot
plt.scatter(x = x_var, y = y_var)
Here’s the plot:
As you can see, the chart looks different. More professional, in my opinion.
The background color has been changed. There are gridlines now. The default color for the points is actually slightly different. The changes here are actually pretty minor, but I think they make a big difference in making the chart look better.
Run this to remove seaborn formatting
One quick note about the using seaborn formatting.
If you run the seaborn.set() function above, you may find that all of your pyplot charts have that formatting.
How do you turn it off?
You can remove the seaborn formatting by using the seaborn.reset_orig() function.
# REMOVE SEABORN FORMATTING
sns.reset_orig()
A matplotlib scatter plot using multiple parameters
Let’s do one more example.
Here, we’re going to use several of the parameters and techniques from prior examples together in a single example. The output will be a little more polished, and it will give you a sense of how to create a scatter plot with pyplot while controlling multiple parameters at the same time.
Here’s the code:
# FINALIZED EXAMPLE
import seaborn as sns
sns.set()
plt.scatter(x_var, y_var, s = 120, c = 'red')
And here is the output:
Not bad.
It’s not perfect, and we could probably do a few things to improve it, but a plot like this will be “good enough” in many circumstances.
If you want to learn data science in Python, learn matplotlib
Having said that, if you really want to get the most out of our data visualizations in Python, you need to learn a lot more about matplotlib and pyplot. We’ve really just covered the basics here.
Moreover, if you’re serious about learning data science in Python, you really need to know matplotlib. Data visualization is an important part of data science, and if you’re doing data visualization in Python, matplotlib is often the tool of choice.
For more Python data science tutorials, sign up for our email list
If you’re interested in data science in Python, sign up for our email list now.
Every week, we publish data science tutorials here at the Sharp Sight blog.
By signing up, you’ll get our tutorials delivered directly to your inbox.
You’ll get free tutorials on:
Matplotlib
NumPy
Pandas
Base Python
Scikit learn
Machine learning
Deep learning
… and more.
Want to learn data science in Python? Sign up now.
Sign up for FREE data science tutorials
If you want to master data science fast, sign up for our email list.
When you sign up, you'll receive FREE weekly tutorials on how to do data science in R and Python.
Give me free tutorials!
Check your email inbox to confirm your subscription ...
This tutorial will show you how to use the NumPy max function, which you’ll see in Python code as np.max.
At a high level, I want to explain the function and show you how it works. That being the case, there are two primary sections in this tutorial: the syntax of NumPy max, and examples of how to use NumPy max.
If you’re still getting started with NumPy, I recommend that you read the whole tutorial, start to finish. Having said that, if you just want to get a quick answer to a question, you can skip ahead to the appropriate section with one of the following links:
First, let’s talk about NumPy and the NumPy max function.
A quick introduction to the NumPy max function
It’s probably clear to you that the NumPy max function is a function in the NumPy module.
But if you’re a true beginner, you might not really know what NumPy is. So before we talk about the np.max function specifically, let’s quickly talk about NumPy.
If you’re interested in data science in Python, NumPy is very important. This is because a lot of data science work is simply data manipulation. Whether you’re doing deep learning or data analysis, a huge amount of work in data science is just cleaning data, prepping data, and exploring it to make sure that it’s okay to use.
Again, because of the importance of data manipulation, NumPy is very important for data science in Python.
Numpy is a toolkit for working with numeric data
Specifically though, NumPy provides a set of tools for working with numeric data.
Python has other toolkits for working with non-numeric data and data of mixed type (like the Pandas module). But if you have any sort of numeric data that you need to clean, modify, reshape, or analyze, NumPy is probably the toolkit that you need.
Essentially, NumPy gives you a toolkit for creating arrays of numeric data, and performing calculations on that numeric data.
One of the computations you can perform is calculating the maximum value of a NumPy array. That’s where the np.max function comes in.
NumPy max computes the maxiumum of the values in a NumPy array
The numpy.max() function computes the maximum value of the numeric values contained in a NumPy array. It can also compute the maximum value of the rows, columns, or other axes. We’ll talk about that in the examples section.
Syntactically, you’ll often see the NumPy max function in code as np.max. You’ll see it written like this when the programmer has imported the NumPy module with the alias np.
Additionally, just to clarify, you should know that the np.max function is the same thing as the NumPy amax function, AKA np.amax. Essentially np.max is an alias of np.amax. Aside from the name, they are the same.
A high level example of how np.max works
Later in this tutorial, I’ll show you concrete examples of how to use the np.max function, but right here I want to give you a rough idea of what it does.
For example, assume that you have a 1-dimensional NumPy array with five values:
We can use the NumPy max function to compute the maximum value:
Although this example shows you how the np.max() function operates on a 1-dimensional NumPy array, it operates in a similar way on 2-dimensional arrays and multi-dimensional arrays. Again, I’ll show you full examples of these in the examples section of this tutorial.
Before we look at the code examples though, let’s take a quick look at the syntax and parameters of np.max.
The syntax of numpy max
The syntax of the np.max function is fairly straight forward, although a few of the parameters of the function can be a little confusing.
Here, we’ll talk about the syntactical structure of the function, and I’ll also explain the important parameters.
A quick note
One quick note before we start reviewing the syntax.
Syntactically, the proper name of the function is numpy.max().
Having said that, you’ll often see the function in code as np.max().
Why?
Commonly, at the start of a program that uses the NumPy module, programmers will import the NumPy function as np. You will literally see a line of code in the program that reads import numpy as np. Effectively, this imports the NumPy module with the alias np. This enables the programmer to refer to NumPy as np in the code, which enables them to refer to the numpy.max function as np.max.
Having said that, let’s take a closer look at the syntax.
An explanation of the syntax
At a high level, the syntax of np.max is pretty straight forward.
There’s the name of the function – np.max() – and inside of the function, there are several parameters that enable us to control the exact behavior of the function.
Let’s take a closer look at the parameters of np.max, because the parameters are what really give you fine-grained control of the function.
The parameters of np.max
The numpy.max function has four primary parameters:
a
axis
out
keepdims
Let’s talk about each of these parameters individually.
a (required)
The a parameter enables you to specify the data that the np.max function will operate on. Essentially, it specifies the input array to the function.
In many cases, this input array will be a proper NumPy array. Having said that, numpy.max (and most of the other NumPy functions) will operate on any “array like sequence” of data. That means that the argument to the a parameter can be a Python list, a Python tuple, or one of several other Python sequences.
Keep in mind that you need to provide something to this argument. It is required.
axis (optional)
The axis parameter enables you to specify the axis on which you will calculate the maximum values.
Said more simply, the axis parameter enables you to calculate the row maxima and column maxima.
I’ll explain how to do that with more detail in the examples section below, but let me quickly explain how the axis parameter works.
Axes are like directions along the NumPy array. In a 2-dimensional array, axis 0 is the axis that points down the rows and axis 1 is the axis that points horizontally across the columns.
The array parameter specifies the axis to compute the maxima
So how does this relate to the axis parameter?
When we use the axis parameter in the numpy.max function, we’re specifying the axis along which to find the maxima.
This effectively lets us compute the column maxima and row maxima.
Let me show you what I mean.
Remember that axis 0 is the axis that points downwards, down the rows.
When we use the code np.max(axis = 0) on an array, we’re effectively telling NumPy to compute the maximum values in that direction … the axis 0 direction.
Effectively, when we set axis = 0, we’re specifying that we want to compute the column maxima.
Similarly, remember that in a 2-dimensional array, axis 1 points horizontally. Therefore, when we use NumPy max with axis = 1, we’re telling NumPy to compute the maxima horizontally, in the axis 1 direction.
This effectively computes the row maxima.
I’ll show you concrete code examples of how to do this, later in the examples section.
Keep in mind that the axis parameter is optional. If you don’t specify an axis, NumPy max will find the maximum value in the whole NumPy array.
out (optional)
The out parameter allows you to specify a special output array where you can store the output of np.max.
It’s not common to use this parameter (especially if you’re a beginner) so we aren’t going to discuss this in the tutorial.
out is an optional parameter.
keepdims (optional)
The keepdims parameter is a little confusing, so it will take a little effort to understand.
Ultimately, the keepdims parameter keeps the dimensions of the output the same as the dimensions of the input.
To understand why this might be necessary, let’s take a look at how the numpy.max function typically works.
When you use np.max on a typical NumPy array, the function reduces the number of dimensions. It summarizes the data.
For example, let’s say that you have a 1-dimensional NumPy array. You use NumPy max on the array.
When you use np.max on a 1-d array, the output will be a single number. A scalar value … not a 1-d array.
Essentially, the functions like NumPy max (as well as numpy.median, numpy.mean, etc) summarise the data, and in summarizing the data, these functions produce outputs that have a reduced number of dimensions.
Sometimes though, you want the output to have the same number of dimensions. There are times when if the input is a 1-d array, you want the output to be a 1-d array (even if the output array has a single value in it).
You can do this the keepdims parameter.
By default, keepdims is set to False. So by default (as discussed above), the dimensions of the output will not be the same as the dimensions of the input. By default, the dimensions of the output will be smaller (because np.max summarizes the data).
But if you set keepdims = True, the output will have the same dimensions as the input.
This is a little abstract without a concrete example, so I’ll show you an example of this behavior later in the examples section.
And actually, now that we’ve reviewed the parameters, this is a good spot to start looking at the examples of NumPy max.
Examples: how to use the numpy max function
In this section, I’m going to show you concrete examples of how to use the NumPy max function.
I’ll show you several variations of how to find the maximum value of an array. I’ll show you how to find the maximum value of a 1-d array, how to find the max value of a 2-d array, and how to work with several of the important parameters of numpy.max.
Run this code first
Before we get started, there are some preliminary things you need to do to get set up properly.
First, you need to have NumPy installed properly on you computer.
Import numpy
Second, you need to have NumPy imported into your working environment.
You can import NumPy with the following code:
import numpy as np
Notice that we’ve imported NumPy as np. That means that we will refer to NumPy in our code with the alias np.
Ok, now that that’s finished, let’s look at some examples.
Compute the max of a 1-dimensional array
We’ll start simple.
Here, we’re going to compute the maximum value of a 1-d NumPy array.
To do this, we’ll first just create a 1-dimensional array that contains some random integers. To create this array, we’ll use the numpy.random.randint() function. Keep in mind that you need to use the np.random.seed() function so your NumPy array contains the same integers as the integers in this example.
This syntax will create a 1-d array called np_array_1d.
We can print out np_array_1d using the print() function.
print(np_array_1d)
And here’s the output:
[ 4, 44, 64, 84, 8]
Visually, we can identify the maximum value, which is 84.
But let’s do that with some code.
Here, we’ll calculate the maximum value of our NumPy array by using the np.max() function.
np.max(np_array_1d)
Which produces the following output:
84
This is an extremely simple example, but it illustrates the technique. Obviously, when the array is only 5 items long, you can visually inspect the array and find the max value. But this technique will work if you have an array with thousands of values (or more!).
Compute the maximum of a 2-d array
Next, let’s compute the maximum of a 2-d array.
To do this, obviously we need a 2-d array to work with, so we’ll first create a 2-dimensional NumPy array.
To create our 2-d array, we’re going to use the np.random.choice() function. Essentially, this function is going to draw a random sample from the integers between 0 and 8, without replacement. After np.random.choice() is executed, we’re using the reshape() method to reshape the integers into a 2-dimensional array with 3 rows and 3 columns.
Let’s take a look by printing out the array, np_array_2d.
print(np_array_2d)
[[8 2 6]
[7 1 0]
[4 3 5]]
As you can see, this is a 2-d array with 3 rows and 3 columns. It contains the integers from 0 to 8, arranged randomly in the array.
Now, let’s compute the max value of the array:
np.max(np_array_2d)
Which produces the following output:
8
Again, this is a very simple example, but you can use this with a much larger 2-d array and it will operate in the same way. Once you learn how to use this technique, try it with larger arrays!
Next, let’s do something more complicated.
… in the next examples, we’ll compute the column maxima and the row maxima.
Compute the maximum value of the columns of a 2-d array
First up: we’ll compute the maximum values of the columns of an array.
To do this, we need to use the axis parameter. Specifically, we need to set axis = 0 inside of the numpy.max function.
Let’s quickly review why.
The axis parameter specifies which axis you want to summarize
Remember that NumPy arrays have axes, and that the axes are like directions along the array. In a 2-d array, axis 0 is the axis that points downwards, and axis 1 is the axis that points horizontally.
We can use these axes to define the direction along which to use np.max.
So let’s say that we want to compute the maximum values of the columns. This is equivalent to computing the means downward.
Essentially, to compute the column maxima, we need to compute the maxima in the axis-0 direction.
Compute max with axis = 0
Let me show you how.
Here, we’re going to re-create our 2-d NumPy array. This is the same as the 2-d NumPy array that we created in a previous example, so if you already ran that code, you don’t need to run it again.
This tutorial will explain how to make a matplotlib histogram.
If you’re interested in data science and data visualization in Python, then read on. This post will explain how to make a histogram in Python using matplotlib.
Clicking on any of the above links will take you to the relevant section in the tutorial.
Having said that, if you’re a relative beginner, I recommend that you read the full tutorial.
Ok, let’s get started with a brief introduction to matplotlib.
A quick introduction to matplotlib
If you’re new to Python – and specifically data science in Python – you might be a little confused about matplotlib.
Here’s a very brief introduction to matplotlib. If you want to skip to the section that’s specifically about matplotlib histograms, click here.
What is matplotlib?
Matplotlib is a module for data visualization in the Python programming language.
If you’re interested in data science or data visualization in Python, matplotlib is very important. It will enable you to create very simple data visualizations like histograms and scatterplots in Python, but it will also enable you to create much more complicated data visualizations. For example, using matplotlib, you can create 3-dimensional plots of your data.
Data visualization is extremely important for data analysis and the broader data science workflow. So even if you’re not interested in data visualization per-se, you really do need to master it if you want to be a good data scientist.
That means, if you’re doing data science in Python, you should learn matplotlib.
What is pyplot
Related to matplotlib is pyplot.
You’ll often see pyplot mentioned and used in the context of matplotlib. Beginners often get confused about the difference between matplotlib and pyplot, because it’s often unclear how they are related.
In this tutorial, we’ll be using the plt.hist() function from pyplot. Just remember though that a pyplot histogram is effectively a matplotlib histogram, because pyplot is a sub-module of matplotlib.
Now that I’ve explained what matplotlib and pyplot are, let’s take a look at the syntax of the plt.hist() function.
The syntax of the matplotlib histogram
From this point forward, we’re going to be dealing with the pyplot hist() function, which makes a histogram.
The syntax is fairly straight forward in the simplest case. On the other hand, the hist() function has a variety of parameters that you can use to modify the behavior of the function. Really. There are a lot of parameters.
In the interest of simplicity, we’re only going to work with a few of those parameters.
If you really need to control how the function works, and need to use the other parameters, I suggest you consult the documentation for the function.
The parameters of plt.hist
There are 3 primary parameters that we’re going to cover in this tutorial: x, bins, and color.
x
The x parameter is essentially the input values that you’re going to plot. Said differently, it is the data that you want to plot on the x-axis of your histogram.
This parameter will accept an “array or sequence of arrays.”
Essentially, this means that the numeric data that you want to plot in your histogram should be contained in a Python array.
For our purposes later in the tutorial, we’re actually going to provide our data in the form of a NumPy array. NumPy arrays are also acceptable.
bins
The bins parameter controls the number of bins in your histogram. In other words, it controls the number of bars in the histogram; remember that a histogram is a collection of bars that represent the tally of the data for that part of the x-axis range.
More often than not, you’ll provide an integer value to the bins parameter. If you provide an integer value, the value will set the number of bins. For example, if you set bins = 30, the histogram will have 30 bars.
You can also provide a string or a Python sequence to the bins parameter to get some additional control over the histogram bins. Having said that, using the bins parameter that way can be a little more complicated, and I don’t recommend it to beginners.
Also, keep in mind that the bins parameter is optional, which means that you don’t need to provide a value.
If you don’t provide a value, matplotlib will use a default value. It will use the default value defined in the matplotlib.rcParams file, which contains matplotlib settings. Assuming that you haven’t changed those settings in matplotlib.rcParams, the bins parameter will default to 10 bins.
As you might guess, the color parameter controls the color of the histogram. In other words, it controls the color of the histogram bars.
This parameter is optional, so if you don’t explicitly provide a color value, it will default to a default value (which is typically a sort of inoffensive blue color).
If you decide to manually set the color, you can set it to a “named” color, like “red,” or “green,” or “blue.” Python and matplotlib have a variety of named colors that you can specify, so take a look at the color options if you manipulate the color parameter this way.
You can also provide hexidecimal colors to the color parameter. This is actually my favorite way to specify colors in data visualizations, because it gives you tight control over the aesthetics of the chart. On the other hand, using hex colors is more complicated, because you need to understand how hex colors work. Hex colors are beyond the scope of this blog post, so I won’t explain them here.
Examples: how to make a histogram in matplotlib
Ok, now that I’ve explained the syntax and the parameters at a high level, let’s take a look at some examples of how to make a histogram with matplotlib.
Most of the examples that follow are simple. If you’re just getting started with matplotlib or Python, first just try running the examples exactly as they are. Once you understand them, try modifying the code little by little just to play around and build your intuition. For example, change the color parameter from “red” to something else. Basically, run the code and then play around a little.
Run this code before you get started
One more thing before we get started with the examples.
Before you run the examples, make sure to run the following code:
Import modules
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
This code will import matplotlib, pyplot, and NumPy.
We’re going to be using matplotlib and pyplot in our examples, so you’ll need them.
Create dataset
Also, run this code to create the dataset that we’re going to visualize.
# CREATE NORMALLY DISTRIBUTED DATA
norm_data = np.random.normal(size = 1000, loc = 0, scale = 1)
This will create a dataset called norm_data, using the NumPy random normal function. This data is essentially normally distributed data that has a mean of 0 and a standard deviation of 1. How to use NumPy random normal is beyond the scope of this post, so if you want to understand how the code works, consult our tutorial about np.random.normal.
Ok, on to the actual examples.
How to make a simple histogram with matplotlib
Let’s start simple.
Here, we’ll use matplotlib to to make a simple histogram.
# MAKE A HISTOGRAM OF THE DATA WITH MATPLOTLIB
plt.hist(norm_data)
And here is the output:
This is about as simple as it gets, but let me quickly explain it.
We’re calling plt.hist() and using it to plot norm_data.
norm_data contains normally distributed data, and you can see that in the visualization.
Aesthetically, the histogram is very simple. Because we didn’t use the color parameter or bins parameter, the visualization has defaulted to the default values. There are 10 bins (my current default) and the color has defaulted to blue. The plot is also relatively unformatted.
I will be honest. I think the default histogram is a little on the ugly side. At least, it’s rather plain. That’s OK if you’re just doing data exploration for yourself, but if you need to present your work to other people, you might need to format your chart to make it look more pleasing.
Change the color of the bars
Let’s talk about how to change the color of the bars, which is one way to make your chart more visually appealing.
As noted above, we can change the color of the histogram bars using the color parameter.
As you saw earlier in the previous example, the bar colors will default to a sort of generic “blue” color.
Here, we’re going to manually set it to “red.”
plt.hist(norm_data, color = 'red')
The code produces the following output:
As you can see, the bars are now red.
The chart is still a little visually boring, but this at least shows you how you can change the color. As you become more skilled in data visualization, you can use the color parameter to make your histograms more visually appealing.
Change the number of bins
Now, let’s modify the number of bins.
Changing the number of bars can be important if your data are a little uneven. You can increase the number of bins to get a more fine-grained view of the data. Or, you can decrease the number of bins to smooth out abnormalities in your data.
Because this tutorial is really about how to create a Python histograms, I’m not going to talk a lot about histogram application. However, I do want you to see how you can modify the bins parameter. That will give you more control over the visualization when you begin to apply the technique.
Here’s the code:
plt.hist(norm_data, bins = 50)
And here’s the output:
So what have we done here?
We increased the number of bins by setting bins = 50. As I noted above, the bins parameter generally defaults to 10 bins. Here, by increasing the number of bins to 50, we’ve generated a more fine-grained view of the data. This can help us see minor fluctuations in the data that are invisible when we use a smaller number of bins.
Make your matplotlib histogram look more “professional”
Now that we’ve covered some of the essential parameters of the plt.hist function, I want to show you a quick way to improve the appearance of your plot.
We’re going to use the seaborn module to change the default formatting of the plot.
To do this, we will first import seaborn.
# import seaborn module
import seaborn as sns
Next, we’ll use the seaborn.set() function to modify the default settings of the chart. As you’ll see in a moment, this will change the default values for the background color, gridlines, and a few other things. Ultimately, it will just make your histogram look better.
#set plot defaults using seaborn formatting
sns.set()
Finally, let’s replot the data using plt.hist.
#plot histogram with matplotlib.pyplot
plt.hist(norm_data)
As you can see, the chart looks different. More professional, in my opinion.
The bar colors are slightly different, and the background has been changed. The changes are actually fairly minor, but I think they make a big difference in making the chart look better.
Run this to remove seaborn formatting
One quick note.
If you run the above code and use the sns.set() function to set the plot defaults with seaborn, you might run into an issue.
… you might find that all of your matplotlib charts have the new seaborn formatting.
How do you make that go away?
You can remove the seaborn formatting defaults by running the following code.
# REMOVE SEABORN FORMATTING
sns.reset_orig()
When you run this code, it will return the plot formatting to the matplotlib defaults.
A histogram example using multiple parameters
Ok, let’s do one more example.
Here, I want to show you how to put the pieces together.
We’re going to modify several parameters at once to create a histogram:
# FINALIZED EXAMPLE
import seaborn as sns
sns.set()
plt.hist(norm_data, bins = 50, color = '#CC0000')
And here is the output:
What have we done here?
We used plt.hist() to plot a histogram of norm_data.
Using the bins parameter, we increased the number of bins to 50 bins.
We used the color parameter to change the color of the bars to the hex color ‘#CC0000‘, which a shade of red.
Finally, we used the sns.set() function to change the plot defaults. This modified the background color and the gridlines.
Overall, I think this is a fairly professional looking chart, created with a small amount of code.
There’s definitely more that we could do to improve this chart (with titles, etc), but for a rough draft, it’s pretty good.
If you want to learn data science in Python, learn matplotlib
In this tutorial, we’re really just scratching the surface.
There’s a lot more that you can do with matplotlib, beyond just making a histogram.
To really get the most out of it, and to gain a solid understanding of data visualization in Python, you need to study matplotlib.
For more Python data science tutorials, sign up for our email list
With that in mind, if you’re interested in learning (and mastering) data visualization and data science in Python, you should sign up for our email list right now.
Here at the Sharp Sight blog, we regularly post tutorials about a variety of data science topics … in particular, about matplotlib.
If you sign up for our email list, our Python data science tutorials will be delivered to your inbox.
You’ll get free tutorials on:
Matplotlib
NumPy
Pandas
Base Python
Scikit learn
Machine learning
Deep learning
… and more.
Want to learn data science in Python? Sign up now.
Sign up for FREE data science tutorials
If you want to master data science fast, sign up for our email list.
When you sign up, you'll receive FREE weekly tutorials on how to do data science in R and Python.
Give me free tutorials!
Check your email inbox to confirm your subscription ...
In this tutorial, you’ll learn how to create a matplotlib bar chart.
Specifically, you’ll learn how to use the plt.bar function from pyplot to create bar charts in Python.
Bar charts in Python are a little challenging
I’ll be honest … creating bar charts in Python is harder than it should be.
People who are just getting started with data visualization in Python sometimes get frustrated. I suspect that this is particularly true if you’ve used other modern data visualization toolkits like ggplot2 in R.
But if you’re doing data science or statistics in Python, you’ll need to create bar charts.
The contents of this tutorial
To try to make bar charts easier to understand, this tutorial will explain bar charts in matplotlib, step by step.
The tutorial has several different sections. Note that you can click on these links and they will take you to the appropriate section.
If you need help with something specific, you can click on one of the links.
However, if you’re just getting started with matplotlib, I recommend that you read the entire tutorial. Things will make more sense that way.
Ok. First, let’s briefly talk about matplotlib.
A quick introduction to matplotlib
If you’re new to data visualization in Python, you might not be familiar with matplotlib.
Matplotlib is a module in the Python programming language for data visualization and plotting.
For the most part, it is the most common data visualization tool in Python. If you’re doing data science or scientific computing in Python, you are very likely to see it.
However, even though matplotlib is extremely common, it has a few problems.
The big problem is the syntax. Matplotlib’s syntax is fairly low-level. The low-level nature of matplotlib can make it harder to accomplish simple tasks. If you’re only using matplotlib, you might need to use a lot of code to create simple charts.
There’s a solution to this though.
To simplify matplotlib, you can use pyplot.
What is pyplot?
Pyplot is a sub-module within matplotlib.
Essentially, pyplot provides a group of relatively simple functions for performing common data visualization tasks.
For example, there are simple functions for creating common charts like the scatter plot, the bar chart, the histogram, and others.
If you’re new to matplotlib and pyplot, I recommend that you check out some of our related tutorials:
In this tutorial though, we’re going to focus on creating bar charts with pyplot and matplotlib.
With that in mind, let’s examine the syntax.
The syntax of the matplotlib bar chart
The syntax to create a bar chart with pyplot isn’t that bad, but it has a few “gotchas” that can confuse beginners.
Let’s take a high-level look at the syntax (we’ll look at the details later).
To create a bar chart with pyplot, we use the plt.bar() function.
Inside of the plt.bar function are several parameters.
In the picture above, I’ve shown four: x, height, width, and color. The plt.bar function has more parameters than these four, but these four are the most important for creating basic bar charts, so we will focus on them.
Let’s talk a little more specifically about these parameters.
The parameters of plt.bar
Here, I’ll explain four important parameters of the plt.bar function: x, height, width, and color.
x
The x parameter specifies the position of the bars along the x axis.
So if your bars are at positions 0, 1, 2, and 3 along the x axis, those are the values that you would need to pass to the x parameter.
You need to provide these values in the form of a “sequence” of scalar values. That means that your values (e.g., 0, 1, 2, 3) will need to be contained inside of a Python sequence, like a list or a tuple.
In this tutorial, I’m assuming that you understand what a Python sequence is. If you don’t, do some preliminary reading on Python sequences first, and then come back when you understand them.
height
The height parameter controls the height of the bars.
Similar to the x parameter, you need to provide a sequence of values to the height parameter …. one value for each bar.
So if there are four bars, you’ll need to pass a sequence of four values. If there are five bars, you need to provide a sequence of five values. Etc.
The width parameter controls the width of the bars.
You can provide a single value, in which case all of the bars will have the same width.
Or, you can provide a sequence of values to manually set the width of different bars.
By default, the width parameter is set to .8.
color
The color parameter controls the interior color of the bars.
You can set the value to a named color (like “red”, “blue”, “green”, etc) or you can set the color to a hexidecimal color.
Although I strongly prefer hex colors (because they give you a lot of control over the aesthetics of your visualizations), hex colors are a little more complicated for beginners. Having said that, this tutorial will only explain how to use named colors (see the examples below).
Examples: how to make a bar chart plot in matplotlib
Ok … now that you know more about the parameters of the plt.bar function, let’s work through some examples of how to make a bar chart with matplotlib.
I’m going to show you individual examples of how to manipulate each of the important parameters discussed above.
Run this code before you get started
Before you work with the examples, you’ll need to run some code.
You need to run code to import some Python modules. You’ll also need to run code to create some simple data that we will plot.
Import modules
Here is the code to import the proper modules.
We’ll be working with matplotlib, numpy, and pyplot, so this code will import them.
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
Note that we’ve imported numpy with the nickname np, and we’ve imported pyplot with the nickname plt. These are fairly standard in most Python code. We can use these nicknames as abbreviations of the modules … this just makes it easier to type the code.
Create dataset
Next, you need to create some data that we can plot in the bar chart.
We’re going to create three sequences of data: bar_heights, bar_labels, and bar_x_positions.
As noted above, most of the parameters that we’re going to work with require you to provide a sequence of values. Here, all of these sequences have been constructed as Python lists. We could also use tuples or another type of Python sequence. For example, we could use the NumPy arange function to create a NumPy array for bar_heights or bar_x_positions. As long as the structure is a “sequence” it will work.
Ok, now that we have our data, let’s start working with some bar chart examples.
How to make a simple bar chart with matplotlib
Let’s start with a simple example.
Here, we’re just going to make a simple bar chart with pyplot using the plt.bar function. We won’t do any formatting … this will just produce a bar chart with default formatting.
To do this, we’re going to call the plt.bar() function and we will set bar_x_positions to the x parameter and bar_heights to the height parameter.
# PLOT A SIMPLE BAR CHART
plt.bar(bar_x_positions, bar_heights)
And here is the output:
This is fairly simple, but there are a few details that I need to explain.
First, notice the position of each of the bars. The bars are at locations 0, 1, 2, and 3 along the x axis. This corresponds to the values stored in bar_x_positions and passed to the x parameter.
Second, notice the height of the bars. The heights are 1, 4, 9, and 16. As should be obvious by now, these bar heights correspond to the values contained in the variable bar_heights, which has been passed to the height parameter.
Finally, notice that we’re passing the values bar_x_positions and bar_heights by position. When we do it this way, Python knows that the first argument (bar_x_positions) corresponds to the x parameter and the second argument (bar_heights) corresponds to the height parameter. There’s a bit of a quirk with matplotlib that if you make the parameter names explicit with the code by typing plt.bar(x = bar_x_positions, height = bar_heights), you’ll actually get an error. So in this example, you have to put the correct variables in the correct positions inside of plt.bar(), and you have to exclude the actual parameter names.
Change the color of the bars
Next, we’ll change the color of the bars.
This is a very simple modification, but it’s the sort of thing that you can make your plot look better, if you do it right.
There are a couple different ways to change the color of the bars. You can change the bars to a “named” color, like ‘red,’ ‘green,’ or ‘blue’. Or, you can change the color to a hexidecimal color. Hex colors are a little more complicated, so I’m not going to show you how to use them here. Having said that, hex colors give you more control, so eventually you should become familiar with them.
Ok. Here, we’re going to make a simple change. We’re going to change the color of the bars to ‘red.’
To do this, we can just provide a color value to the color parameter:
plt.bar(bar_x_positions, bar_heights, color = 'red')
The code produces the following output:
Admittedly, this chart doesn’t look that much better than the default, but it gives you a simple example of how to change the bar colors. This code is easy to learn and easy to practice (you should always start with relatively simple examples).
As you become more skilled with data visualization, you will be able to select other colors that look better for a particular data visualization task.
The point here is that you can change the color of the bars with the color parameter, and it’s relatively easy.
Change the width of the bars
Now, I’ll show you how to change the width of the bars.
To do this, you can use the width parameter.
plt.bar(bar_x_positions, bar_heights, width = .5)
And here’s the output:
Here, we’ve set the bar widths to .5. In this case, I think that the default (.8) is better. However, there may be situations where the bars are spaced out at larger intervals. In those cases, you’ll need to make your bars wider. My recommendation is that you make the space between the bars about 20% of the width of the bars.
How to add labels to your bars
You might have noticed in the prior examples that there is a bit of a problem with the x-axis of our bar charts: they don’t have labels.
Let’s take a look by re-creating the simple bar chart from earlier in the tutorial:
# ADD X AXIS LABELS
plt.bar(bar_x_positions, bar_heights)
It produces the following bar chart:
Again, just take a look at the bar labels on the x axis. By default, they are just the x-axis positions of the bars. They are not the categories.
In most cases, this will not be okay.
In almost all cases, when you create a bar chart, the bars need to have labels. Typically, each bar is labeled with an appropriate category.
How do we do that?
When you use the plt.bar function from pyplot, you need to set those bar labels manually. As you’ve probably noticed, they are not included when you build a basic bar chart like the one we created earlier with the code plt.bar(bar_x_positions, bar_heights).
Here, I’ll show you how.
Add bar labels
To add labels to your bars, you need to use the plt.xticks function.
Specifically, you need to call plt.xticks(), and provide two arguments: you need to provide the x axis positions of your bars as well as the labels that correspond to those bars.
So in this example, we will call the function as follows: plt.xticks(bar_x_positions, bar_labels). The bar_x_positions variable contains the position of each bar, and the bar_labels variable contains the labels of each bar. (Remember that we defined both variables earlier in this tutorial.)
# ADD X AXIS LABELS
plt.bar(bar_x_positions, bar_heights)
plt.xticks(bar_x_positions, bar_labels)
And here is the result:
Notice that each bar now has a categorical label.
Improve the formatting of your pyplot bar chart
Ok, now I’ll show you a quick trick that will improve the appearance of your Python bar charts.
One of the major issues with standard matplotlib bar charts is that they don’t look all that great. The standard formatting from matplotlib is – to put it bluntly – ugly.
To be clear, the basic formatting is fine if you’re just doing some data exploration at your workstation. The basic formatting is okay if you’re creating charts for personal consumption.
But if you need to show your charts to anyone important, then the default formatting probably isn’t good enough. The default formatted charts look basic. They lack polish. They are a little unprofessional. You might not understand this, but you need to realize that the appearance of your charts matters when you present them to anyone important.
That being the case, you need to learn to format your charts properly.
The full details of how to format your charts is beyond the scope of this post, but here I’ll show you a quick way to dramatically improve the appearance of your pyplot charts.
Use seaborn formatting to improve your charts
We’re going to use a special function from the seaborn package to improve our charts.
Install seaborn
To use this function, you’ll need to install seaborn. You can do that with the following code:
# import seaborn module
import seaborn as sns
Use seaborn.set() to change default formatting
Once you have seaborn imported, you can use the seborn.set() function to set new plot defaults for your matplotlib charts. Because we imported seaborn as sns, we can call the function with sns.set().
#set plot defaults using seaborn formatting
sns.set()
This essentially changes many of the plot defaults like the background color, gridlines, and a few other things.
Let’s replot our bar chart so you can see what I mean.
#plot bar chart
plt.bar(bar_x_positions, bar_heights)
This tutorial will explain how to use the Pandas iloc method to select data from a Pandas DataFrame.
Working with data in Pandas is not terribly hard, but it can be a little confusing to beginners. The syntax is a little foreign, and ultimately you need to practice a lot to really make it stick.
To make it easier, this tutorial will explain the syntax of the iloc method to help make it crystal clear.
Additionally, this tutorial will show you some simple examples that you can run on your own.
This is critical. When you’re learning new syntax, it’s best to learn and master the tool with simple examples first. Learning is much easier when the examples are simple and clear.
Having said that, I recommend that you read the whole tutorial. It will provide a refresher on some of the preliminary things you need to know (like the basics of Pandas DataFrames). Everything will be more cohesive if you read the entire tutorial.
But, if you found this from a Google search, and/or you’re in a hurry, you can click on one of the following links and it will take you directly to the appropriate section:
Matplotlib focuses on data visualization. Commonly, when you’re doing data science or analytics, you need to visualize your data. This is true even if you’re working on an advanced project. You need to perform data visualization to explore your data and understand your data. Matplotlib provides a data visualization toolkit so you can visualize your data. You can use matplotlib for simple tasks like creating scatterplots in Python, histograms of single variables, line charts that plot two variables, etc.
And then there’s Pandas.
Pandas is a data manipulation toolkit in Python
Pandas also focuses on a specific part of the data science workflow in Python.
… it focuses on data manipulation with DataFrames.
Again, in this tutorial, I’ll show you how to use a specific tool, the iloc method, to retrieve data from a Pandas DataFrame.
Before I show you that though, let’s quickly review the basics of Pandas dataframes.
Pandas DataFrames basics
To understand the iloc method in Pandas, you need to understand Pandas DataFrames.
Also, the columns can contain different data types (although all of the data within a column must have the same data type).
Essentially, these features make Pandas DataFrames sort of like Excel spreadsheets.
Pandas dataframes have indexes for the rows and columns
Importantly, each row and each column in a Pandas DataFrame has a number. An index.
This structure, a row-and-column structure with numeric indexes, means that you can work with data by the row number and the column number.
That’s exactly what we can do with the Pandas iloc method.
The iloc method: how to select data from a dataframe
The iloc method enables you to “locate” a row or column by its “integer index.”
We use the numeric, integer index values to locate rows, columns, and observations.
integer locate.
iloc.
Get it?
The syntax of the Pandas iloc isn’t that hard to understand, especially once you use it a few times. Let’s take a look at the syntax.
The syntax of the Pandas iloc method
The syntax of iloc is straightforward.
You call the method by using “dot notation.” You should be familiar with this if you’re using Python, but I’ll quickly explain.
To use the iloc in Pandas, you need to have a Pandas DataFrame. To access iloc, you’ll type in the name of the dataframe and then a “dot.” Then type in “iloc“.
Immediately after the iloc method, you’ll type a set of brackets.
Inside of the brackets, you’ll use integer index values to specify the rows and columns that you want to retrieve. The order of the indexes inside the brackets obviously matters. The first index number will be the row or rows that you want to retrieve. Then the second index is the column or columns that you want to retrieve. Importantly, the column index is optional.
If you don’t provide a column index, iloc will retrieve all columns by default.
The syntax is simple, but it takes practice
As I mentioned, the syntax of iloc isn’t that complicated.
It’s fairly simple, but it still takes practice.
Even though it’s simple, it’s actually easy to forget some of the details or confuse some of the details.
For example, it’s actually easy to forget which index value comes first inside of the brackets. Does the row index come first, or the column index? It’s easy to forget this.
It’s also easy to confuse the iloc[] method with the loc[] method. This other data retrieval method, loc[], is extremely similar to iloc[], and the similarity can confuse people. The loc[], method works differently though (we explain the loc method in a separate tutorial).
Although the iloc method can be a little challenging to learn in the beginning, it’s possible to learn and master this technique fast. Here at Sharp Sight, our premium data science courses will teach you to memorize syntax, so you can permanently remember all of those important little details.
This tutorial won’t give you all of the specifics about how to memorize the syntax of iloc. But, I can tell you that it just takes practice and repetition to remember the little details. You need to work with simple examples, and practice those examples over time until you can remember how everything works.
Examples of Pandas iloc
Speaking of examples, let’s start working with some real data.
Like I said, you need to learn these techniques and practice with simple examples.
Here, in the following examples, we’ll cover the following topics:
There are two steps to this. First, we need to create a dictionary of lists that contain the data. Essentially, in this structure, the “key” will be the name of the column, and the associated list will contain the values of that column. You’ll see how this works in a minute.
Now that we have our dictionary, country_data_dict, we’re going to create a DataFrame from this data. To do this, we’ll apply the pd.DataFrame() function to the country_data_dict dictionary. Notice that we’re also using the columns parameter to specify the order of the columns.
Now we have a DataFrame of data, country_data_df, which contains country level economic and population data.
Select a single row iloc
First, I’ll show you how to select single rows with iloc.
For example, let’s just select the first row of data. To do this, we’ll call the iloc method using dot notation, and then we’ll use the integer index value inside of the bracets.
country_data_df.iloc[0]
Which produces the following output:
country USA
continent Americas
GDP 19390604
population 322179605
Name: 0, dtype: object
Essentially, the code pulls back the first row of data, and all of the columns.
Notice that the “first” row has the numeric index of 0. If you’ve used Python for a little while, this should make sense. When we use indexes with Python objects – including lists, arrays, NumPy arrays, and other sequences – the numeric indexes start with 0. The first value of the index is 0. This is very consistent in Python.
Here’s another example.
We can pull back the sixth row of data by using index value 5. Remember, because the index values start at 0, the numeric index value will be one less than the row of data you want to retrieve.
Let’s pull back the row of data at index value 5:
country_data_df.iloc[5]
Which produces the following output:
country India
continent Asia
GDP 2597491
population 1324171354
Name: 5, dtype: object
Again, this is essentially the data for row index 5, which contains the data for India. Here, you can see the data for all of the columns.
Select a single row (alternate syntax)
There’s actually a different way to select a single row using iloc.
This is important, actually, because the syntax is more consistent with the syntax that we’re going to use to select columns, and to retrieve “slices” of data.
Here, we’re still going to select a single row. But, we’re going to use some syntax that explicitly tells Pandas that we want to retrieve all columns.
country_data_df.iloc[0, :]
Which produces the following:
country USA
continent Americas
GDP 19390604
population 322179605
Name: 0, dtype: object
Notice that this is the same output that’s produced by the code country_data_df.iloc[0, :].
What’s going on here?
Notice that in this new syntax, we still have an integer index for the rows. That’s in the first position just inside of the brackets.
But now we also have a ‘:‘ symbol in the second position inside of the brackets.
The colon character (‘:‘) essentially tells Pandas that we want to retrieve all columns.
Remember from the syntax explanation above that we can use two integer index values inside of iloc[]. The first is the row index and the second is the column index.
When we want to retrieve all columns, we can use the ‘:‘ character.
You’ll understand this more later. It’s relevant for when we retrieve ‘slices’ of data.
Select columns with iloc
Similarly, you can select a single column of data using a special syntax that uses the ‘:‘ character.
Let’s say that we want to retrieve the first column of data, which is the column at index position 0.
To do this, we will use an integer index value in the second position inside of the brackets when we use iloc[]. Remember that the integer index in the second position specifies the column that we want to retrieve.
What about the rows?
When we want to retrieve a single column and all rows we need to use a special syntax using the ‘:‘ character.
You’ll use the ‘:‘ character in the first position inside of the brackets when we use iloc[]. This indicates that we want to retrieve all of the rows. Remember, the first index position inside of iloc[] specifies the rows, and when we use the ‘:‘ character, we’re telling Pandas to retrieve all of the rows.
Let me show you an example of this in action.
Code to retrieve the data for a single column
In this example, we’re going to retrieve a single column.
The code is simple. We have our DataFrame that we created above: country_data_df.
We’re going to use dot notation after the DataFrame to call the iloc[] method.
Inside of the brackets, we’ll have the ‘:‘ character, which indicates that we want to get all rows. We also have 0 in the second position inside the brackets, which indicates that we want to retrieve the column with index 0 (the first column in the DataFrame).
Let me show you the code:
country_data_df.iloc[:,0]
And here is the output.
0 USA
1 China
2 Japan
3 Germany
4 UK
5 India
Name: country, dtype: object
Notice that the code retrieved a single column of data – the ‘country‘ column – which is the first column in our DataFrame, country_data_df.
It’s pretty straightforward. Using the syntax explained above, iloc retrieved a single column of data from the DataFrame.
Select a specific cell using iloc
Now, let’s move on to something a little more complicated.
Here, we’re going to select the data in a specific cell in the DataFrame.
You’ll just use iloc[] and specify an integer index value for the data in the row and column you want to retrieve.
So if we want to select the data in row 2 and column 0 (i.e., row index 2 and column index 0) we’ll use the following code:
country_data_df.iloc[2,0]
Which produces the following output:
'Japan'
Again. This is pretty straightforward.
Using the first index position, we specified that we want the data from row 2, and we used the second index position to specify that we want to retrieve the information in column 0.
The data that fits both criteria is Japan, in cell (2, 0).
Notice that the Pandas DataFrame essentially works like an Excel spreadsheet. You can just specify the row and column of the data that you want to pull back.
Retrieve “slices” of data
Now that I’ve explained how to select specific rows and columns using iloc[], let’s talk about slices.
When we “slice” our data, we take multiple rows or multiple columns.
There’s a special syntax to do this, which is related to some of the examples above.
Essentially, we can use the colon (‘:‘) character inside of iloc[] to specify a start row and a stop row.
Keep in mind that the row number specified by the stop index value is not included.
It’s always best to illustrate an abstract concept with a concrete example, so let’s take a look at an example of how to use iloc to retrieve a slice of rows.
Example: retrieve a slice of rows using iloc
Here, we’re going to retrieve a subset of rows.
This is pretty straightforward.
We’re going to specify our DataFrame, country_data_df, and then call the iloc[] method using dot notation.
Then, inside of the iloc method, we’ll specify the start row and stop row indexes, separated by a colon.
Here’s the exact code:
country_data_df.iloc[0:3]
And here are the rows that it retrieves:
country continent GDP population
0 USA Americas 19390604 322179605
1 China Asia 12237700 1403500365
2 Japan Asia 4872137 127748513
If you’re a relative beginner with NumPy, I recommend that you read the full tutorial.
But if you only need help with a specific aspect of the NumPy median function, then you can click on one of the links below. The following links will take you to the appropriate section of the tutorial:
If you’re a real beginner, you may not be 100% familiar with NumPy. So before I explain the np.median function, let me explain what NumPy is.
What exactly is NumPy?
Numpy is a data manipulation module for Python
NumPy is a data manipulation module for the Python programing language.
At a high level, NumPy enables you to work with numeric data in Python. A little more specifically, it enables you to work with large arrays of numeric data.
NumPy also has a set of tools for performing computations on arrays of numeric data. You can do things like combine arrays of numeric data, split arrays into multiple arrays, or reshape arrays into arrays with a new number of rows and columns.
NumPy also has a set of functions for performing calculations on numeric data. The NumPy median function is one of these functions.
Now that you have a broad understanding of what NumPy is, let’s take a look at what the NumPy median function is.
NumPy median computes the median of the values in a NumPy array
The NumPy median function computes the median of the values in a NumPy array. Note that the NumPy median function will also operate on “array-like objects” like Python lists.
Let’s take a look at a simple visual illustration of the function.
Imagine we have a 1-dimensional NumPy array with five values:
We can use the NumPy median function to compute the median value:
It’s pretty straight forward, although the np.median function can get a little more complicated. It can operate on 2-dimensional or multi-dimensional array objects. It can also calculate the median value of each row or column. You’ll see some examples of these operations in the examples section.
Ok. Now let’s take a closer look at the syntax of the NumPy median function.
The syntax of numpy median
A quick note
One quick note. This explanation of the syntax and all of the examples in this tutorial assume that you’ve imported the NumPy module with the code import numpy as np.
This is a common convention among NumPy users. When you write and run a NumPy/Python program, it’s common to import NumPy as np. This enables you to refer to NumPy with the “nickname” np, which makes the code a little simpler to write and read.
I just wanted to point this out to you to make sure you understand.
An explanation of the syntax
Ok. Let’s take a look at the syntax.
Assuming that you’ve imported NumPy as np, you call the function by the name np.median(). In some programs, you might also see the function called as numpy.median(), if the coder imported NumPy as numpy. Both are relatively common, and it really depends on how the NumPy module has been imported.
Inside of the median() function, there are several parameters that you can use to control the behavior of the function more precisely. Let’s talk about those.
The parameters of numpy median
The np.median function has four parameters that we will discuss:
a
axis
out
keepdims
There’s actually a fifth parameter called overwrite_input. The overwrite_input parameter is not going to be very useful for you if you’re a beginner, so for the sake of simplicity, we’re not going to discuss it in this tutorial.
Ok, let’s quickly review what each parameter does:
a (required)
a
The a parameter specifies the data that you want to operate on. It’s the data on which you will compute the median.
Typically, this will be a numpy array. However, the np.median function can also operate on “array-like objects” such as Python lists. For the sake of simplicity, this tutorial will work with NumPy arrays, but remember that many (if not all) of the examples would work the same way if you used an array-like object instead.
Note that this parameter is required. You need to provide something to the a parameter, otherwise the np.median function won’t work.
axis (optional)
The axis parameter controls the axis along which the function will compute the median.
More simply, the axis parameter enables you to compute median values along the rows of an array, or the median values along the columns of an array (instead of computing the median of all of the values).
Using the axis parameter confuses many people.
Later in this tutorial, I’ll show you an example of how to use the axis parameter; hopefully that will make it more clear.
But quickly, let me explain how this works.
NumPy arrays have axes. It’s best to think of axes as directions along the array.
So if you have a 2-dimensional array, there are two axes: axis 0 is the direction down the rows and axis 1 is the direction across the columns. (Keep in mind that higher-dimensional arrays have additional axes.)
When we use NumPy functions like np.median, we can often specify an axis along which to perform the computation.
So when we set axis = 0, the NumPy median function computes the median values downward along axis 0. This effectively computes the column medians.
Similarly, when we set axis = 1, the NumPy median function computes the median values horizontally across axis 1. This effectively computes the row medians.
Hopefully these images illustrate the concept and help you understand.
But if you’re still confused, I’ll show you examples of how to use the axis parameter later in the examples section.
out (optional)
The out parameter enables you to specify a different output array where you can put the result.
So if you want to store the result of np.median in a different array, you can use the out parameter to do that.
This is an optional parameter.
keepdims (optional)
The keepdims parameter enables you to make the dimensions of the output the same as the input.
This is a little confusing to many people, so let me explain.
Remember that the np.median function (and other similar functions like np.sum and np.mean) summarize your data in some way. They are computing summary statistics.
When you summarize the data in this way, you are effectively collapsing the number of dimensions of the data. For example, if you have a 1-dimensional NumPy array, and you compute the median, you are collapsing the data from a 1-dimensional structure down to a 0 dimensional structure (a single scalar number).
Or similarly, if you compute the column means of a 2-d array, you’re collapsing the data from 2 dimensions down to 1 dimension.
Essentially, the output of the NumPy median function has a reduced number of dimensions.
What if you don’t want that? What if you want the output to have the same number of dimensions as the input?
You can force NumPy median to make keep the dimensions the same by using the keepdims parameter. We can set keepdims = True to make the dimensions of the output the same as the dimensions of the input.
I understand that this might be a little abstract, so I’ll show you an example in the examples section.
Note: the keepdims parameter is optional. By default it is set to keepdims = False, meaning that the output of np.array will not necessarily have the same dimensions as the input.
Examples: how to use the numpy median function
Ok. Let’s work through some examples. In the last section I explained the syntax, which is probably helpful. But to really understand the code, you need to play with some examples.
Run this code first
Before you get started with the examples though, you’ll need to run some code.
Import numpy
You need to import NumPy. Run this code to properly import NumPy.
import numpy as np
By running this code, you’ll be able to refer to NumPy as np when you call the NumPy functions.
Compute the median of a 1-dimensional array
Ok.
This first example is very simple. We’re going to compute the median value of a 1-dimensional array of values.
This is pretty straight forward. Using the np.array function, we’ve created an array with six values from 0 to 100, in increments of 20.
Now, we’ll calculate the median of these values.
np.median(np_array_1d)
Which gives us the following output:
50.0
This is fairly straightforward, but I’ll quickly explain.
Here, the NumPy median function takes the NumPy array and computes the median.
The median of these six values is 50, so the function outputs 50.0 as the result.
Compute the median of a 2-d array
Next, let’s work through a slightly more complicated example.
Here, we’re going to calculate the median of a 2-dimensional NumPy array.
First, we’ll need to create the array. To do this, we’re going to use the NumPy array function to create a NumPy array from a list of numbers. After that, we’re going to use the reshape method to reshape the data from 1-dimensional array to a 2-dimensional array that has 2 rows and 3 columns.
And we can examine the array by using the print() function.
print(np_array_2d)
[[ 0 20 40]
[ 60 80 100]]
As you can see, this dataset has six values arranged in a 2 by 3 NumPy array.
Now, we’ll compute the median of these values.
np.median(np_array_2d)
Which produces the following output:
50.0
This example is very similar to the previous example. The only difference is that in this example, the values are arranged into a 2-dimensional array instead of a 1-dimensional array.
Ultimately though, the result is the same.
If we use the np.median function on a 2-dimensional NumPy array, by default, it will just compute the median of all of the values in the array. Here in this example, we only have six values in the array, but we could also have a larger number of values … the function would work the same.
Moreover, the NumPy median function would also work this way for higher dimensional arrays. For example, if we had a 3-dimensional NumPy array, we could use the median() function to compute the median of all of the values.
However, with 2-d arrays (and multi-dimensional arrays) we can use the axis parameter to compute the median along rows, columns, or other axes.
Let’s take a look.
Compute the median value of the columns of a 2-d array
First, I’m going to show you how to compute the median of the columns of a 2-dimensional NumPy array.
To do this, we need to use the axis parameter. Remember from earlier in the tutorial that NumPy axes are like directions along the rows and columns of a NumPy array.
Remember: axis 0 is the direction that points down against the rows, and axis 1 is the direction that points horizontally across the columns (in a 2-d array).
The axis parameter specifies which axis you want to summarize
So how exactly does the axis parameter control the behavior of np.median?
This is important: when you use the axis parameter, the axis parameter controls which axis gets summarized.
Said differently, it controls which axis gets collapsed.
So if you set axis = 0 inside of np.median, you’re effectively telling NumPy to compute the medians downward. The medians will be computed down along axis 0. Essentially, it will collapse axis 0 and compute the medians down that axis.
In other words, it will compute the column medians.
This confuses many people, because they think that by setting axis = 0, it will compute the row medians. That’s not how it works.
Again, it helps to think of NumPy axes as directions. The axis parameter specifies the direction along which the medians will be computed.
Compute a median with axis = 0
Let me show you.
Here, we’re going to compute the column medians by setting axis = 0.
NumPy calculated the medians along axis 0. This effectively computes the column medians:
Again, this might seem counter intuitive, so remember what I said previously. The axis parameter controls which axis gets summarized. By setting axis = 0, we told NumPy median to summarize axis 0.
Compute the row medians with axis = 1
Now, let’s compute the row medians.
This example is almost identical to the previous..