The Bloomberg team has come up with a few goodies lately. I was captivated by the following graphic about the ebb and flow of U.S. presidential candidates across recent campaigns. Link to the full presentation here.
The highlight is at the bottom of the page. This is an excerpt of the chart:
From top to bottom are the sequential presidential races. The far right vertical axis is the finish line. Going right to left is the time before the finish line. In 2008, for example, there are candidates who entered the race much earlier than typical.
This chart presents an aggregate view of the data. We get a sense of when most of the candidates enter the race, when most of them are knocked out, and also a glimpse of outliers. The general pattern across multiple elections is also clear. The design is a stacked area chart with the baseline in the middle, rather than the bottom, of the chart.
Sure, the chart can disappoint those readers who want details and precise numbers. It's not immediately apparent how many candidates were in the race at the height of 2008, nor who the candidates were.
The designer added a nice touch. By clicking on any of the stacks, it transforms into a bar chart, showing the extent of each candidate's participation in the race.
I wish this was a way to collapse the bar chart back to the stack. You can reload the page to start afresh.
This elegant design touch makes the user experience playful. It's also an elegant way to present what is essentially a panel of plots. Imagine the more traditional presentation of placing the stack and the bar chart side by side.
This design does not escape the trade-off between entertainment value and data integrity. Looking at the 2004 campaign, one should expect to see the blue stack halve in size around day 100 when Kerry became the last man standing. That moment is not marked in the stack. The stack can be interpreted as a smoothed version of the count of active candidates.
I suppose some may complain the stack misrepresents the data somewhat. I find it an attractive way of presenting the big-picture message to an audience that mostly spend less than a minute looking at the graphic.
Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.
U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.
Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.
The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.
This is the histogram for arts, entertainment and recreation.
The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.
This is the histogram for the public sector.
It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.
The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)
On the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.
The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.
For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.
To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.
That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.
In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.
The net sum of the bar lengths is a plausible measure of gender inequality.
The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.
In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.
The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.
Earlier this month, the bombs in Sri Lanka led to some data graphics in the media, educating us on the religious tensions within the island nation. I like this effort by Reuters using small multiples to show which religions are represented in which districts of Sri Lanka (lifted from their twitter feed):
The key to reading this map is the top legend. From there, you'll notice that many of the color blocks, especially for Muslims and Catholics are well short of 50 percent. The absence of the darkest tints of green and blue conveys important information. Looking at the blue map by itself misleads - Catholics are in the minority in every district except one. In this setup, readers are expected to compare between maps, and between map and legend.
The overall distribution at the bottom of the chart is a nice piece of context.
The above design isolates each religion in its own chart, and displays the spatial spheres of influence. I played around with using different ways of paneling the small multiples.
In the following graphic, the panels represent the level of dominance within each district. The first panel shows the districts in which the top religion is practiced by at least 70 percent of the population (if religions were evenly distributed across all districts, we expect 70 percent of each to be Buddhists.) The second panel shows the religions that account for 40 to 70 percent of the district's residents. By this definition, no district can appear on both the left and middle maps. This division is effective at showing districts with one dominant religion, and those that are "mixed".
In the middle panel, the displayed religion represents the top religion in a mixed district. The last panel shows the second religion in each mixed district, and these religions typically take up between 25 and 40 percent of the residents.
The chart shows that other than Buddhists, Hinduism is the only religion that dominates specific districts, concentrated at the northern end of the island. The districts along the east and west coasts and the "neck" are mixed with the top religion accounting for 40 to 70 percent of the residents. By assimilating the second and the third panels, the reader sees the top and the second religions in each of these mixed districts.
This example shows why in the Trifecta Checkup, the Visual is a separate corner from the Question and the Data. Both maps utilize the same visual design, in terms of forms and colors and so on, but they deliver different expereinces to readers by answering different questions, and cutting the data differently.
Today we return to the basics. In a twitter exchange with Dean E., I found the following pie chart in an Atlantic article about who's buying San Francisco real estate:
The pie chart is great at one thing, showing how workers in the software industry accounted for half of the real estate purchases. (Dean and I both want to see more details of the analysis as we have many questions about the underlying data. In this post, I ignore these questions.)
After that, if we want to learn anything else from the pie chart, we have to read the data labels. This calls for one of my key recommendations: make your charts sufficient. The principle of self-sufficiency is that the visual elements of the data graphic should by themselves say something about the data. The test of self-sufficiency is executed by removing the data printed on the chart so that one can assess how much work the visual elements are performing. If the visual elements require data labels to work, then the data graphic is effectively a lookup table.
This is the same pie chart, minus the data:
Almost all pie charts with a large number of slices are packed with data labels. Think of the labeling as a corrective action to fix the shortcoming of the form.
Let's look at all the efforts made to overcome the lack of self-sufficiency.
Here is a zoom-in on the left side of the chart:
Data labels are necessary to help readers perceive the sizes of the slices. But as the slices are getting smaller, the labels are getting too dense, so the guiding lines are being stretched.
Eventually, the designer gave up on labeling every slice. You can see that some slices are missing labels:
The designer also had to give up on sequencing the slices by the data. For example, hardware with a value of 2.4% should be placed between Education and Law. It is shifted to the top left side to make the labeling easier.
Fitting all the data labels to the slices becomes the singular task at hand.
I enjoyed the New York Times's data viz showing how actively the Democratic candidates were criss-crossing the nation in the month of March (link).
It is a great example of layering the presentation, starting with an eye-catching map at the most aggregate level. The designers looped through the same dataset three times.
This compact display packs quite a lot. We can easily identify which were the most popular states; and which candidate visited which states the most.
I noticed how they handled the legend. There is no explicit legend. The candidate names are spread around the map. The size legend is also missing, replaced by a short sentence explaining that size encodes the number of cities visited within the state. For a chart like this, having a precise size legend isn't that useful.
The next section presents the same data in a small-multiples layout. The heads are replaced by dots.
This allows more precise comparison of one candidate to another, and one location to another.
This display has one shortcoming. If you compare the left two maps above, those for Amy Klobuchar and Beto O'Rourke, it looks like they have visited roughly similar number of cities when in fact Beto went to 42 compared to 25. Reducing the size of the dots might work.
Then, in the third visualization of the same data, the time dimension is emphasized. Lines are used to animate the daily movements of the candidates, one by one.
I have a longer article on the sister blog about the research design of a study claiming 420 "cannabis" Day caused more road accident fatalities (link). The blog also has a discussion of the graphics used to present the analysis, which I'm excerpting here for dataviz fans.
The original chart looks like this:
The question being asked is whether April 20 is a special day when viewed against the backdrop of every day of the year. The answer is pretty clear. From this chart, the reader can see:
that April 20 is part of the background "noise". It's not standing out from the pack;
that there are other days like July 4, Labor Day, Christmas, etc. that stand out more than April 20
It doesn't even matter what the vertical axis is measuring. The visual elements did their job.
If you look closely, you can even assess the "magnitude" of the evidence, not just the "direction." While April 20 isn't special, it nonetheless is somewhat noteworthy. The vertical line associated with April 20 sits on the positive side of the range of possibilities, and appears to sit above most other days.
The chart form shown above is better at conveying the direction of the evidence than its strength. If the strength of the evidence is required, we use a different chart form.
I produced the following histogram, using the same data:
The histogram is produced by first locating the midpoints# of the vertical lines into buckets, and then counting the number of days that fall into each bucket. (# Strictly speaking, I use the point estimates.)
The midpoints# are estimates of the fatal crash ratio, which is defined as the excess crash fatalities reported on the "analysis day" relative to the "reference days," which are situated one week before and one week after the analysis day. So April 20 is compared to April 13 and 27. Therefore, a ratio of 1 indicates no excess fatalities on the analysis day. And the further the ratio is above 1, the more special is the analysis day.
If we were to pick a random day from the histogram above, we will likely land somewhere in the middle, which is to say, a day of the year in which no excess car crashes fatalities could be confirmed in the data.
As shown above, the ratio for April 20 (about 1.12) is located on the right tail, and at roughly the 94th percentile, meaning that there were 6 percent of analysis days in which the ratios would have been more extreme.
This is in line with our reading above, that April 20 is noteworthy but not extraordinary.
In a previous post, we learned that top U.S. colleges have become even more selective over the last 15 years, driven by a doubling of the number of applicants while class sizes have nudged up by just 10 to 20 percent.
The top 25 most selective colleges are included in the first group. Between 2002 and 2017, their average rate of admission dropped from about 20% to about 10%, almost entirely explained by applicants per student doubling from 10 to almost 20. A similar upward movement in selectivity is found in the first four groups of colleges, which on average accept at least half of their applicants.
Most high school graduates however are not enrolling in colleges in the first four groups. Actually, the majority of college enrollment belongs to the bottom two groups of colleges. These groups also attracted twice as many applicants in 2017 relative to 2002 but the selectivity did not change. They accepted 75% to 80% of applicants in 2002, as they did in 2017.
In this post, we look at a different view of the same data. The following charts focus on the growth rates, indexed to 2002.
To my surprise, the number of college-age Americans grew by about 10% initially but by 2017 has dropped back to the level of 2002. Meanwhile, the number of applications to the colleges continues to climb across all eight groups of colleges.
The jump in applications made selectivity surge at the most selective colleges but at the less selective colleges, where the vast majority of students enroll, admission rate stayed put because they gave out many more offers as applications mounted. As the Pew headline asserted, "the rich gets richer."
Enrollment has not kept up. Class sizes expanded about 10 to 30 percent in those 15 years, lagging way behind applications and admissions.
How do we explain the incremental applications?
Applicants increasing the number of schools they apply to
The untapped market: applicants who in the past would not have applied to college
Non-U.S. applicants: this is part of the untapped market, but much larger
My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at?
It's packing too much.
I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.
The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates.
The vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.
Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.
Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.
The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.
As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?
Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate.
The team at 538 did a post-mortem of their in-season forecasts of NBA playoffs, using Bumps charts. These charts have a long history and can be traced back to Cambridge rowing. I featured them in these posts from a long time ago (link 1, link 2).
Here is the Bumps chart for the NBA West Conference showing all 15 teams, and their ranking by the 538 model throughout the season.
The highlighted team is the Kings. It's a story of ascent especially in the second half of the season. It's also a story of close but no cigar. It knocked at the door for the last five weeks but failed to grab the last spot. The beauty of the Bumps chart is how easy it is to see this story.
Now, if you'd focus on the dotted line labeled "Makes playoffs," and note that beyond the half-way point (1/31), there are no further crossings. This means that the 538 model by that point has selected the eight playoff teams accurately.
Now what about NBA East?
This chart highlights the two top teams. This conference is pretty easy to predict at the top.
What is interesting is the spaghetti around the playoff line. The playoff race was heart-stopping and it wasn't until the last couple of weeks that the teams were settled.
Also worthy of attention are the bottom-dwellers. Note that the chart is disconnected in the last four rows (ranks 12 to 15). These four teams did not ever leave the cellar, and the model figured out the final rankings around February.
Using a similar analysis, you can see that the model found the top 5 teams by mid December in this Conference, as there are no further crossings beyond that point.
*** Go check out the FiveThirtyEight article for their interpretation of these charts.
While you're there, read the article about when to leave the stadium if you'd like to leave a baseball game early, work that came out of my collaboration with Pravin and Sriram.