Jon Schwabish maintains an excellent site of resources for newbies and professionals alike. One section of note is Remakes, where Schwabish takes an existing chart and offers critique by making a new version. But Schwabish might be known best for The PolicyViz Podcast, a series of interviews with practitioners in the field.
It almost goes without saying, but tables are a way to
Tables of regression results are a common type of table researchers create. But they can be frustrating to create well—point estimates, standard errors, asterisks, variable names, column headers, and notes means there are a lot of potential objects to corral. The “outreg2” command in Stata, for example, enables you to output your regression results to Excel files, and will get you pretty close; and I’ve been told (but haven’t tried) the broom package in R. In any case, there are likely times when your table results don’t look great because you’re moving from one tool to another.
Take this table, for example. It’s just a random example you could likely find in any academic journal article or working paper. Start with the first column of numbers and notice how the decimals are not lined up for many of them— -0.8094, for example, is shifted slightly to the right. It’s even a little worse when there are asterisks denoting statistical significance. Look at the third column—the 0.0016*** at the top and 0.00054*** about halfway down are not lined up with the rest of the numbers.
Aligning numbers along the decimal/comma makes it easier for your reader to find large and small values. Overall, it also seems to make tables easier to navigate. In which table is it easier to find the largest and smallest values?
For creating regression tables in Excel, what’s the best way to align all of the various table elements together? One challenge is that if you have point estimates with asterisks denoting statistical significance, Excel treats those values as Text and not as Numbers. Thus, the standard formatting approach where you would simply format the cells as numbers is not going to work. I think I’ve come up with a workaround using Custom Number Formats that will enable you to format tables of regression results using formulas so that you can easily apply them to all of your tables.
If you have a
different method, please share it in the comments section below—I’m curious to
see what approaches others are taking!
A Quick Primer on
Custom Number Formats
Custom number formats allow you to control how numbers are displayed in Excel. The primary benefit of custom number formats is that the formats stay the same while the numbers can change. I won’t go into a full treatment of how they work here but if you’re interested I highly recommend Dave Bruns’ blog post over at ExcelJet.
The basic structure of custom number formats may look complex, but they have a specific syntax. Semicolons are used to parse the formats into four sections:
You don’t need to specify all four values, and formats can handle different colors, special symbols, and other formatting.
Using Number Formats
Let’s say you want to format two numbers—5,000 and -5,000—and
you want to align them following three rules:
they should both have a comma after the 5,
the negative number is placed within parentheses,
both numbers are aligned along the comma.
In other words, you want “5,000” and “(5,000)” aligned along
We can use Excel’s built-in Custom Format menu to apply this preferred format. Select the two
cells, right-click, and select Format
Cells. At the bottom of that menu is the Custom area, in which we can select this default built-in format: #,##0_);(#,##0).
This format is doing something pretty simple:
Positive numbers (the first argument) are
formatted with a comma and any trailing zeros; the “_)” term adds a space the
width of a closed parentheses after the number. (They underscore says ignore
the next object and insert a space the width of that object, here a
Negative numbers (the second argument) are similarly
formatted with a comma and a trailing zero and the number is surrounded by
Another way to apply these formats is to use Excel’s TEXT
function. The TEXT function has two arguments:
Value: the number to be formatted
Format_text: the number format to be applied
In this case, we put the number formats in their own cells and then refer to them in the TEXT formula: =TEXT(A1,B1). As you can see in cells C1 and C2, this gives you the same result as before, but you don’t need to click on anything.
Now that we (kind of) understand Excel’s custom number
formatting menu and how to use the TEXT function, we can put them together to format
a table of regression coefficients.
The first step is to move the numbers from the original
table (above) into Excel. A simple copy and paste yields a table that is even
more unaligned than the original.
One potentially easy way to resolve the formatting issue is to put the asterisks in their own columns. If we do that and merge the column headers across the pairs of columns, the digits look lined up, but the asterisks look a little far from the numbers. This isn’t the worst table in the world and at least the numbers are aligned along the commas. This has been the approach I’ve been using for a while.
Now let’s try using the TEXT function to create a table in
which the asterisks are close to the numbers and everything is nicely aligned.
Using the same approach as before, I’m going to first
determine how many spaces I need after the number so that the numbers are
aligned along the decimal. That determination is made based on the statistical
significance (one, two, or three asterisks). To help make that assessment, I
use the table above in which the numbers are in one cell and the asterisks are
by themselves in a neighboring cell.
You can see in this little lookup table, I list the various
formats I’m going to use in the main table. If the number is statistically
significant, I’ll include asterisks in the format; if not, I’ll include some
number of spaces after the number to get the spacing right.
I can now embed the TEXT formula within an IF formula to reference each format. In the second
estimate in Model 1 [“Percent Black” in cell AL4: 0.2628**], for example, the
All this says is if the point estimate has one star, place
the point estimate in cell J3 (the original number) and use the number format
in cell AL18; if it has two stars, use the point estimate in cell J3 and use
the number format in cell AL17; and so on. As you can see, the final format is
a number with a leading zero, four decimals, and two asterisks.
The last part is to format the column titles. At the top of
the table, you can see the differences between aligning the numbers to the
left, right, or center. Personally, I’m partial to customizing the placement of
the title—as long as I’m doing the numbers, I might as well do the text too. I
again use the TEXT formula to apply the format specified at the bottom of the
Paul Kiel and Hannah Fresques at ProPublica wrote this really interesting article a few weeks ago about the geographic distribution of Internal Revenue Service (IRS) audit rates. They found that states with the lowest audit rates “tend to be home to middle income, largely white populations.” Using data tabulated by former IRS senior economist Kim M. Bloomquist, they created two county-level maps of the IRS audit rates, one that shows counties are audited at a higher rate than the national average and another that shows counties that are audited at a lower rate.
The maps remind me of this great piece from the Washington Post that uses a pair of tile grid maps to examine the relationship between state-level suicide rates and gun ownership rates. As much as I liked those maps and these ProPublica maps, it’s really difficult to see the relationship between audit rates and, say, incomes across the country. I thought a scatterplot might be another visual that would help show this relationship in a clearer way.
Paul and Kim very nicely sent me their audit rate data—which they have now made publicly available—and I grabbed the poverty rate for each county from the US Census Bureau. Using Tableau, I created a scatterplot where the circles are all equally-sized and a bubble plot where the bubbles are sized according to population (data again from Census). I don’t love the bubble plot because the small counties become really small (even if you use logs). In both cases, you can see the positive relationship between the poverty rate and audit rate, which is harder to see in the maps.
Because part of the ProPublica article mentions the two major US political parties, I also layered on the political leanings of each county by adding colors–red for counties that voted for Donald Trump and blue for those that voted for Hillary Clinton in the 2016 US presidential election (voting data from Tom McGovern’s Github page).
As you can see, all but three of the counties below a 6% audit rate voted for Trump. Many of the counties in the top-right part of the space are blue, but they tend to be fairly equally distributed elsewhere in the space. I also added some labels to highlight some of the obvious outliers and also added a couple of annotations to help explain what’s going on in the graph.
I don’t view this scatterplot as a replacement for the two maps in the original ProPublica story, but
perhaps as an additional perspective that can help readers more clearly see the
correlation between the audit rate and this measure of poverty. That being said,
perhaps many readers would have a hard time reading this scatterplot and it would
just clutter the story. In either case, it was a fun exercise to play around
with the data.
This post was originally published on the Urban Institute’s blog, Urban Wire on March 20, 2019.
Major League Baseball’s (MLB) Opening Day is less than two weeks away. For fans like me, it’s a glorious day of blue skies, green grass, and white baseballs with red stitching. For players like Mike Trout and Bryce Harper, it’s time to start driving in runs and helping win games.
Trout is reportedly close to signing the largest contract in sports history, a $430 million, 12-year extension with the Los Angeles Angels. Trout’s contract comes just days after Harper’s own record-breaking $330 million, 13-year contract with the Philadelphia Phillies.
These contracts are the most recent in a long and dramatic increase in MLB salaries. According to the MLB Players Association (PDF), the average ballplayer earned more than $4 million in 2017, more than 34 times their average salary in 1967, adjusted for inflation. In 2017, the average MLB player earned more than 47 times America’s average household income of $86,220.
Average MLB salaries grew by more than 3,200 percent over the last half century, while average household incomes (as reported by the US Census Bureau) grew by 68 percent. The increase in MLB salaries has started to slow only over the past decade or so—which mirrors the growth (or lack thereof) in average household incomes over that same period.
The growth in MLB average salaries can be tied to various factors. The introduction of free agency in 1976 appears to be a big turning point (the introduction of free agency is an interesting story itself). Player strikes, salary arbitration, and revenue growth from new ballparks and TV have also played an important role over the past few decades.
Growth in salaries escalated after the players’ strike in 1994 but then slowed after the luxury tax was introduced in 2003. (The intention of the luxury tax is to prevent wealthy teams from outspending their less lucrative competitors and has led some to argue it has turned into a “de facto salary cap.”) Between 2003 and 2017, the average MLB salary grew by 29 percent, still faster than the 9 percent growth in average household incomes.
Comparing MLB players’ salaries with average Americans’ is just one way to grasp the growing divide between the wealthiest Americans and the lower and middle classes, which Urban Institute researchers have studied and explained.
Wealth inequality is greater than income inequality
A recent study by the Federal Reserve Bank of Cleveland suggests that income gaps are the primary driver behind the racial wealth gap. The study found that the average wealth of households with a head identifying as black was $140,000 in 2016, while the average for white-headed households was $901,000, nearly six and a half times greater.
Wealth inequality matters to everyone and affects generations of families. If you own a home, that wealth can be passed on to your children, but if you are unable to purchase a home because of systematic discrimination, for example, that affects not just you and your family but also the generations that follow.
Urban researchers have also examined the short- and long-term effects of debt on wealth and inequality. Having debt in collections can lower households’ credit scores, which can have a deep, long-lasting effect on their finances and make it harder and more expensive to access credit.
We’ll likely see many proposals to address income and wealth inequality as we head deeper into the presidential election season (though, probably not as we get deeper into the baseball season).
The debate over how best to combat increasing inequality—from training a skilled workforce to applying new technologies—will continue. As the conversation progresses, Urban researchers will provide evidence and insight to help guide the way.
The primary workflow of many organizations goes through the Microsoft Office system. They use PowerPoint for presentations, Word for documents and reports, and Excel for data analysis and visualization. For some of these organizations, applying their branding styles and guidelines to their slides, reports, and graphs is typically a manual and extraordinarily tedious process. That’s I’ve teamed up with Jon Peltier from PeltierTech to create a new Excel add-in tool that will automatically apply your organization’s styles and guidelines to your visualizations.
Before we describe how the tool works, let’s first define
the problem we are trying to solve.
For the most part, the Microsoft suite is a manual set of
tools. Yes, you can use some coding to try to streamline certain processes, but
for the most part, it is a manual toolset that requires the user to physically
create something and move objects around within and between tools. You can
create certain templates and themes, but they tend to be limited in scope and
Let’s say you have a fairly limited data visualization style
guide your organization follows with just a few guidelines:
A specific color palette
A specific font
Title in that specific font, left-aligned with
the y-axis, bolded and at an 16pt size.
A logo in the bottom-right part of every chart.
Using the existing Theme
menus, you can define the color palette and font so that every chart will
use them. But what about the next three items on your list? The default
location for a chart title in Excel is centered over the chart in a 14pt size.
You could create each chart and save it as a Template, but these templates are not very flexible.
That’s where this new add-in comes in. What we have created
is a new tab on the Excel ribbon that is integrated with the existing Excel
charting menu and also applies your organization’s branded styles. Not only
will your organization’s colors and fonts be applied correctly, but we can add
a placeholder for a title in the exact location and size you want; logo in the
right spot; a placeholder for Sources and Notes; different sizes for different
purposes; and exporting options to save your charts in different formats such
as pictures and for PowerPoint and Word.
We have also created options that will allow you to create charts
that are not in the standard Excel menu, including a Slope Chart and Dot Plot.
We can also add other chart types, such as cycle plots and small multiples.
All of this is built in the Visual Basic for Applications
(VBA) programming language, which means it can be installed manually or pushed
from a central IT department. And we can customize any of the features—want the
title bold, 18pt, and right-aligned? Fine, we can set that up for you. Want
your logo in the bottom-left corner? Yep, we can do that too.
So instead of having people in your organization manually apply different styles, adding and moving around text boxes, copying and pasting between Office tools, we believe this add-in will streamline that entire process. It will make Excel charts look better, apply the styles more consistently, and, importantly, save people time (and frustration).
At the moment, we are planning to sell customized versions
to individuals, groups, and organizations. We can also help create a specific
data visualization style guide, like the ones collected here, and then import those
specific styles into the add-in. If you’re interested in learning more, please contact us and use the “Excel
Add-In” in the drop-down menu.
In the meantime, here are a few videos to help you better understand what the add-in offers:
Introduction to the Excel Chart Styles Add-In - YouTube
I started creating Excel versions of the graphs in the Financial Times’ Visual Vocabulary a couple of weeks ago and have successfully completed all but 13. The FT’s graphic library was based on the Graphic Continuum work Severino Ribecca and I have been working on for a few years now. Neither claims to have the “answer” for the “right” graph you should use or to even include all of the possible graphs, but they both serve to help you explore other possibilities for your data visualization work.
Today, I’m publishing the Visual Vocabulary graphs across eight separate Excel files in a single .zip file you can purchase and download for $10 at my PolicyViz Shop. Proceeds from your purchase primarily go to help support the PolicyViz Podcast, including audio editing and transcription services.
In the package, you will find eight separate Excel files, each one corresponding to a different category of the Visual Vocabulary. I have not included specific step-by-step instructions in those files to help you construct the charts, but I have added notes here and there to help you recreate them. I have also included some text boxes with VBA code I use to simplify the creation or styling process–VBA is not necessary to create any of these charts, but it can make some of the more tedious tasks faster.
Creating these charts in Excel adds to the growing body of Visual Vocabulary graphs created in other tools. Andy Kriebel has created most of the graphs in Tableau (and nicely sent me his workbook so I could use some of his datasets), Pratap Vardhan created a version using Vega, and Jason Thomas built a PowerBI version.
Like other data and data visualization tools, Excel has both its pros and cons. My goal in creating this resources is to demonstrate that it is possible to create different kinds of charts not in the standard Excel menu and by providing you with these templates, you will hopefully be able to create your own versions with your own data.
If you’re interested in learning more about creating data visualizations in Excel, check out my upcoming Data Visualization in Excel workshop on May 23, 2019 in Washington, DC. You can register on my Eventbrite page.
A couple of weeks ago, I started making the graphs in the Financial Times’ Visual Vocabulary in Microsoft Excel. Based on the Graphic Continuum work I’ve done with Severino Ribecca, the FT’s Visual Vocabulary contains about 70 graphs with a definition of each. I had particular fun figuring out how to create the Arc (or what I call the Arc-Dot) chart in Excel. In this post, I show you how to create the chart in Excel using the distribution of Members in the 116th US House of Representatives.
The Arc-Dot Chart
The Arc-Dot chart lays out dots in a semi-circle to show the
distribution of politicians in different political parties. The version I
create here shows the 435 Members of the 116th U.S. House of
Representatives, of which 235 are Democrats, 197 are Republican, and 3 are
The basic philosophy is pretty straightforward: Use a scatterplot to position points on an x- and y-dimension around a half-circle. Thus, we need to figure out how many rings we want (I’ll use 11 here) and how many dots on each, the total of which sums to 435.
Here’s a quick screenshot of the entire worksheet. I’ll go through each section so you can see how this is built. You can download my Excel file at the end of the post.
The Lookup Table
I set up a little lookup table to the side of my main data
in columns R-Z. The first four columns are where I can play around with setting
up the number of points in each ring, so there isn’t a consistent formula in
these columns. I set the radius of each ring in column X, which I’ll use in my
main data table in a moment. In this version, I add 0.1 to each ring, which I
thought looked good—between that distance and the size of the circles, you’ll
need to decide what looks best.
The most important part of this lookup table is the Scalar series in column Y. The formula
in these cells is =180/(S3-1), which is 180-degrees divided by the number of
dots in each ring minus one. We’ll use these scalars to push the x- and y-positions
of each dot to the right.
The Main Data Table
Let’s now go to the main data table, in columns A-P, and
define the first few:
Column A: The observation number. This counts up
by one to 435 observations.
Column B: Ring Number. Cell B3: =VLOOKUP(A3,$U$2:$V$13,2,1).
I use an approximate match VLOOKUP formula to assign the Ring Number to each
Column C: Radius. Cell C3: =VLOOKUP(B3,$R$2:$Y$13,7,0).
Again, an approximate match VLOOKUP to assign the radius for each ring.
Column D: Scalar. Cell D3: =VLOOKUP(B3,$R$2:$Y$13,8,0).
We built this scalar earlier in the lookup table, so I just use an exact match
Column E. Dummy. Cell E4: =IF(OR(A4=$U$3,A4=$U$4,A4=$U$5,A4=$U$6,A4=$U$7,A4=$U$8,A4=$U$9,A4=$U$10,A4=$U$11,A4=$U$12,A4=$U$12),0,1).
I want to place a zero in the first occurrence of each ring and a one
thereafter, so this formula (there is probably a better/more concise way to do
this) compares the observation number in column A to the Cumulative values in
column U. (Note: I hard-coded cell E3 here).
Column F. Number. Cell F4: =IF(E4=0,0,D4+F3). The
points in each ring will start at zero (the first observation in each ring) and
then stack up using the Scalar value calculated earlier. Here then, the first
observation in each ring is zero and each subsequent cell adds the scalar
Now we start moving things into polar coordinates to build
Column G. Radians. Cell G4: =(F3/180)*PI(). The
basic circle to convert to radians.
Column H. X-position. Cell H4: =C3*COS(G3).
Multiply the radius of each ring by the cosine of the radians.
Column I. Y-position. Cell I4: =C3*SIN(G3).
Multiply the radius of each ring by the sine of the radians.
As you can see in the screenshot, the x-position
of the first observation of each ring is equal to the radius and the y-position
You could simply plot the series in columns H and I, and
then color each of the 435 circles by hand, but that seems tedious. Instead,
let’s use some formulas to create three separate series for each group
(Democrats: xDEM, yDEM; Republicans: xREP, yREP; Vacancies: xVAC, yVAC) and
plot them simultaneously.
We need a couple more lookup tables to do this. I did this
semi-consistently in columns AB-AF by setting the number of Democrats in each
ring by dividing the number of points in each row in half. I decided that I
wanted the vacancies in the first, fourth, and eighth row, so set those ones
manually. The number of Republicans is then just the residual.
I create a cumulative count of the points in each ring in columns
AH-AL. I’ll use these to assign the x- and y-positions for each of the three
Column AH. Repeats each ring number three times.
Column AI. Repeats each party/group three times.
Column AJ. Cell AJ3: =HLOOKUP(AI3,$AD$2:$AF$13,AH3+1).
Uses an HLOOKUP formula and uses the “Splits” lookup table to assign the number
of dots in each ring for each party.
Column AK. Cell AK: =AK3+AJ3. This cumulatively
adds the position values (cell AK3 is hard-coded to zero).
Column AL: Repeats an initial for each group;
I’ll use this in a VLOOKUP in a moment and the initial is used just for
Now that we have these two lookup tables, let’s go back to
our main data table.
Column J. Cell J3: =VLOOKUP(A3,$AK$3:$AL$36,2,1).
Use a VLOOKUP formula to look up whether each observation (in column A) will be
a Democrat, Republican, or Vacant (as initials).
Columns K-P. Cell K3: =IF(J3=”D”,H3,NA()).
For the rest, we use a simple IF formula to check the initial of each party and
if it matches the condition, then we assign the x- and y-position; if not, it
is assigned an “#N/A”, which Excel will then ignore when the chart is built.
Build the Chart
Phew! Still with me? I know, this is a lot, but we’re almost
there. We’ll now create a scatterplot with three series (columns K-P). I set
the size of each circle marker to 12, color and fill as is usual. This is now
easier to do because I can select each series as opposed to doing each point
one by one.
I could add the middle label (“435”) with a text box, but I
prefer to use data, so I have a final scatterplot series with an x-position of
0 and a y-position of 0.36 (manually set where I want it). The marker is hidden
and a data label is added with the name of the series and sized as I want it.
I admit, this is not the easiest chart to make in the world,
but now that it’s done, it’s relatively easy to automate, though adding more
groups is kind of a hassle. You may not need all of the formulas as I’ve laid them out here if you want to do
some of this manually, but I find that setting the whole thing up first like
this makes life easier down the road.
If you’d like to make your own Arc-Dot chart, you can download my Excel file. I’ll publish the entire Visual Vocabulary in Excel next week.
My affinity for tile grid maps is well documented. I think there a nice alternative to the standard choropleth map and even though they introduce their own geographic distortions (such as South Carolina placed to the east of North Carolina), they are generally easy to create and easy to read.
Given my creation of a US tile grid map, European tile grid map, and the ill-conceived global tile grid map, it was time to bring it a little closer to home. I’ve now made a tile grid map of the 25 zip codes in Washington, DC. (Hat tip to Ann Emery for reminding me that I had created this some time ago and had failed to post it.)
As with all tile grid maps, there are a few issues with this
I haven’t quite nailed down the curvature on the
western edge along the Potomac.
Some zip codes vary greatly in size, so they are
hard to align. For example, in this map, zip codes 20005, 20006, and 20036 (“A”,
“B”, and “C”) are basically on top of each other. I can’t really do that in the
I think the biggest issue is that I didn’t get
the three large zip codes in Anacostia (20032, 2002, 20019) to sit on their own
along the south-eastern edge of the District.
How to Make It
If you’ve read any of my previous tutorials on the tile grid
map, you’ll notice that I create them using Excel’s Conditional Formatting menu. You can’t tilt the cells in an Excel
spreadsheet, however, so I took a different approach and instead will use a
scatterplot with diamond shapes. (Of course, it is possible to create the map
in the spreadsheet, take a screenshot, and then rotate the image.)
Step 1. Data
Preparation. I have each zip codes and their populations in columns B and
C. I create a little look-up table to place them into four separate bins. I use
a VLOOKUP formula in column D to pull out the Group numbers from my look-up
The values in that look-up table could be quartiles or
quintiles or anything really.
Step 2. Data Series.
To make this whole thing a little bit easier, I’m going to plot four separate
series (to coincide with the four separate groups as defined in the look-up
table). By using separate series, the colors in the chart will update
automatically without having to make manual changes to the colors each time. I
use a pretty simple formula in columns E-L; in the first cell of column E, for
example, I use this: =IF($D3=1,$M3,NA()).
This checks the zip code’s group and, if true, places the
“xFull” and “yFull” values in those cells; if false, it inserts an #N/A, which
Excel will ultimately ignore when we create the plot. The “xFull” and “yFull”
values come from playing around with the scatterplot layout. I simply created a
6×6 grid and started messing around to find where each zip code would go.
Step 3. Create the
Chart. Create a scatterplot with the four series, select the points, and
change the Marker Type to a diamond.
Then, increase the size of the diamond so the edges touch each other (in my
version, to 54). Select Marker Colors for
the points and change the Marker Border to
white. Because I’ve used four different series, I need to set the colors four
times, but because this is all built with formulas, when the data or cutoffs
change, I won’t need to do anything to the colors.
Step 4. Add Labels.
If you’re in Excel 2016 (PC version), you can use the custom label option to
label the scatterplots with a custom range. On a Mac or in previous versions,
you need to do this by hand by adding data labels and then manually editing
them to the zip code number.
A quick aside: I’ve also tried creating a tile grid map for Virginia counties, but some of the smaller counties and districts are really troublesome to line up. I may save that for another day. (A square state like Colorado or Wyoming, or a smaller state, would probably be easier.)
So that’s it, really. A scatterplot with large diamonds. Here’s the Excel file if you want to play around. If you have a better layout, please send it along!
I’ve been making my way through Sandra Rendgen’s new book, The Minard System: The Complete Statistical Graphics of Charles-Joseph Minard. It’s a wonderful journey through Minard’s work and in particular, his development of the flow map. While Minard is generally known for his flow maps, especially his Napoleon map, Rendgen argues that engineer Henry Drury Harness actually created the first flow map in 1837 (relating to railroad traffic in Ireland).
As I read Rendgen’s book, it became clear to me that I’m not exactly sure who invented a lot of the graphs we use today. William Playfair is credited with inventing a lot of our most common graph types like the bar, pie, and line chart, but who invented the Word Cloud or the Bump chart?
Some are easier—the Venn Diagram (John Venn), Sankey Diagram (Matthew Henry Phineas Riall Sankey), and the Gantt chart (Henry Gantt)—are named after specific people. Of course, it’s also possible that John Venn didn’t create the first Venn diagram, but he got credited with it.
In an effort to build a catalog of data visualization inventors, I’ve created this Google Sheet with a list of graphs (based on my Graphic Continuum project) and their inventors. I’m inviting you to help me fill in this list with more graphs, inventor names, and sources. I’ve filled in a few, but there’s a ways to go (and for some, I’ve just listed a random website or Wikipedia).
I’m sure others know more much more about the history of data visualization than I, so I’m optimistic we’ll have a nice catalog here. I’ve also added a sheet for you to leave your name and email address so I can send you updates when it gets filled in.
Spoiler alert! This post is going to ruin the end of the movie Avengers: Infinity War. The movie was released in April of this year and grossed more than $2 billion worldwide, so if you haven’t seen it, you probably don’t care about having the ending ruined. But, in case you’re the one person who reads this blog, likes superhero movies, hasn’t seen the movie, and doesn’t want the ending ruined, here’s a gif of the bad guy, Thanos. You can go ahead and close your browser now.
Okay, now that it’s just the rest of us, let me tell you about the ending of Infinity War. The Avengers fight Thanos, the intergalactic bad guy whose goal is to collect all six Infinity Stones and put them in this cool-looking metal glove. With the power from the Stones, Thanos plans on saving the universe by eradicating half of all life.
What would eradicating half of all life in the universe look like? I don’t know how many people live in the entire universe, but I can get estimates of the number of people who live on Earth. So, just for kicks, I simulated what the world’s population would like if you randomly killed half of the roughly 7 billion people on the planet.
The simulation was relatively easy. I downloaded population estimates for each country in 2017 from the World Development Indicators database at the World Bank. I assigned a uniformly-distributed random number (bounded between 0 and 1) to each country and multiplied it by each country’s 2017 population. I then calculated the gap between this new global population and the actual 2017 population. I then uniformly redistribute the gap between to hit my target of half the 2017 population. In other words, if I initially kill 100 million people too many, I uniformly put them back in countries across the world, making sure that the population in no country goes negative or above its 2017 population. Once the framework is set up, I ran 200 simulations in the same way.
Notice that I’m not stratifying by country type, region, or demographics. I don’t know Thanos’ training in statistics, but I, like others, have wondered what his sampling framework looks like. Stratifying by location or demographics could ensure that no category (in my case, country) completely disappears. But, I digress. Anyways…
Visualizing the Results
The challenge then comes in visualizing the results. My instinct is to show a choropleth map and then animate through the 200 simulations. Populous countries like China and India become problematic because even in simulations where they lose lots of people, they are still very large countries. There are nearly 1.4 billion people living in China—even if half disappear, it is still the second-most populous country in the world (behind only India), still more than double the number of people in the United States. So most maps would likely show China and India with dark colors and little color variation across the rest of the world.
The other way to deal with these large countries is to take the natural log of everything. I generally shy away from using logs mainly because I’m not sure many people will understand them. But in this case, it might be fine (also because this is just a fun exercise, not a real viz).
The other thing I can do is to combine the map with another visualization type—I thought histograms work nicely. The histogram gives you a good view of the distribution of countries at different population levels and how they change between the different simulations. (I also considered a morphing cartogram, but I didn’t feel like learning how to do it in D3 or some other programming language.)
All that being said, I created dashboard in Tableau with a histogram paired with a choropleth map. I clicked play and recorded the results in the videos below. The first uses population measured in levels with the histogram above and map below. The second measures everything in logs and I include a third graph in the top-right of the number of countries with no population. You can see more variation in the map with logs. In both cases, I highlight China and India.
Avengers Infinity War Data Visualization - World (Levels) - YouTube
Avengers Infinity War Data Visualization - World (Logs) - YouTube
Because I live in the US and have US population data sitting on my computer, I did the same thing for the US. While California, Texas, Florida, and New York are all large states, they are not magnitudes larger than other states in the country. Thus, using population levels seems to work.
Avengers Infinity War Data Visualization - US - YouTube
You can see all of the US simulations in this smoothed histogram. The dark blue line represents the actual 2017 population, and you can see how the population gets pushed to the left side.
I’m not sure what the upcoming Captain Marvel movie will bring, but it promises to give us some hope that the Avengers will be able to save all of us in the Avengers 4 movie coming next year. Maybe there will be another visualization to make.
These are all great books, ones that have helped shape my thinking about how to best communicate data and create visualizations. And I’m going to give them away. I’ll also throw in a copy of my own book, so that you can give it to someone you think can use it.
-Post your review to Amazon and let me know you posted it, either by or on Twitter.
-The lottery will be held on New Years’ Day.
-This contest is without replacement—you can only win once. -If there are fewer than 40 reviews, no one wins. In other words, if only four people post reviews, they don’t each win a set of books. It’s got to be (at least) 40 reviews. So write your review and encourage your friends and colleagues who’ve used the book to write a review.
I hear lots of great stories about how my book has helped people do a better job communicating their work. So if my book as helped you elevate your skills, I’m asking you to please share the love by sharing those success stories.
That’s it. 40+ reviews. Random drawing. Lots of great books. For free. New Year’s Day.