A quantitative research firm, MeasuringU focuses on “the statistical analysis of human behavior and quantifying the user experience.” The MeasuringU blog specifically centers on usability, customer experience, and statistics, making it a valuable customer analytics blog.
UX research pulls many terms, methods, and conventions from other fields.
Selecting a method is an important first choice in measuring the user experience. But an important next step is understanding the variables you’ll have to deal with when designing a study or drawing conclusions.
Variables are things that change. Variables can be controlled and measured.
It sounds simple enough but there are actually different types of variables and understanding the difference is one of the keys to properly interpreting your findings (and avoiding common mistakes). In this article, I’ll cover the differences between five different variables: dependent, independent, latent, observed, and extraneous/nuisance variables.
Dependent (Outcome) Variables
The dependent variable (often called the outcome variable) is what we hope changes when we change an interface—either through a new design or from fixing problems. Dependent variables are more easily recognized as commonly used UX metrics. For example, in usability tests with tasks, common task-level (dependent variable) metrics include completion rate, task time, and perceived ease (using the SEQ).
There’s more to the user experience than task-performance metrics. Other common dependent variables include people’s broader attitudes. Attitudes towards the brand, ease of use, trust, and appearance are also examples of dependent variables.
These attitudes are operationalized (turned into something you can measure) using questionnaires with rating scales, such as the SUS, SUPR-Q, and UMUX-Lite. Attitudes, when measured properly, can be good predictors of behavior, may explain behavior, and are often much easier to measure than behavior. For example, one of the key drivers of dating website usage and referral behavior is users’ trust in the quality of the people (and few scams) on the dating platforms.
Effective UX research links more traditional task and product metrics to higher level business metrics, such as retention rate, revenue, number of calls to customer support, and referral rates. For example, increasing trust from users on a dating website can be linked to increased time and usage on the platform and the corresponding revenue that it brings.
It’s a common mistake to want a dependent variable to tell you what to fix. After all, if you’re going through the trouble of collecting UX measures, don’t you want the payoff of getting some diagnostic information? You will get diagnostic information, but it usually won’t come from the dependent variables directly.
The outcome of these metrics DEPEND on the independent variable. Instead of looking to the dependent variables on what to do, you should look to the independent variables to understand what caused the change and what you can do to hopefully improve the dependent variable (or if a change was made, if it led to an improvement). I’ll cover selecting dependent variables in more detail in a future article.
The independent variables are what you manipulate or change. This manipulation can be specific design changes within an interface or app, such as a redesigned form or new navigation labels. But it can also be alternative versions of the same design (version A or B), a current version and competing version, or different tasks you administer in the same study. Using the language of experimental design, we’d say independent variables have levels, one for each variation.
For example, if we’re comparing two versions of an interface, we’d say the independent variable has two levels (A or B—hence an A/B test). The two levels could also be a current website and competitor website. If we’re testing one website with five tasks, the independent variable would be the tasks with five levels.
It can help to work through examples of how the dependent and independent variables work together.
Example 1: Sixty people who were recently researching flights online were randomly assigned to find the cheapest flight on the Delta Airlines website or the American Airlines website (30 participants on each website). The completion rate was 40% on American and 83% on Delta. All other things being equal, we can deduce that the difference in completion rates is from our manipulation of the independent variable, the website.
Dependent Variable: Completion Rate
Independent Variable: Websites, Two Levels (Delta and American)
Example 2: The same 30 participants who looked for a flight on the Delta website in Example 1 were then asked to select a seat on the Delta website. The completion rate was 50% (half were able to find the seat selector and actually select a seat). We can deduce that the task of finding a seat was harder than finding the cheapest flight on the same website.
Dependent Variable: Completion Rate
Independent Variable: Task, Two Levels (Find Cheapest Flight, Select a Seat)
The dependent variable (completion rate) of 50% itself can be meaningful because we understand percentages (100% is perfect and 0% is horrible), but the completion rate can be made more meaningful when it’s compared using the independent variable. Because we’ve manipulated the independent variable by having different websites and different tasks, we know that 50% actually isn’t that great because on the same website, the same users had an 83% completion rate for another task. But how do we know what to fix? The completion rate doesn’t tell us that. From observing people attempting the task we saw the root cause of failure is an awkward seat selector (clicking the legend opens a new tab; see Figure 1).
Figure 1: The seat-selector legend on the Delta website led to increased task failure.
So why not just dispense with the dependent variables altogether and watch people? There are at least two good reasons you should use dependent variables: 1) The relatively low completion rate (especially compared to the earlier task and competitors) tells us something wasn’t right, and we should look further for problems. 2) If we now attempt to fix the seat-selector legend with a new design (a new independent variable with two levels of old design and new design), we need the completion rate to tell us whether the change had an impact. If the completion rate remains at 50%, we should look for other reasons.
Example 3: Fifty recent users of each of the eHarmony and PlentyOfFish dating websites rated their trust of the dating platforms using the 8-item SUPR-Q that includes two questions on trust. The trust scores for eHarmony were about average at the 54th percentile, but those for PlentyofFish were at the 5th percentile. Because SUPR-Q scores are already being compared to 150+ other websites, we can deduce that not only do current users trust eHarmony more than PlentyofFish, but they have really poor trust in PlentyofFish.
Dependent Variable: Trust Scores
Independent Variable: Website, Two Levels (eHarmony, PlentyofFish)
The dependent variable is made doubly meaningful by using a standardized measure (the SUPR-Q trust score) and through the independent variable showing PlentyofFish has a trust problem. But how do we know what to fix? In this case, we asked participants to tell us their concerns with the website and respondents clearly identified the high number of fraudulent profiles.
Latent vs. Observed Variables
When we measure trust, as in the dating website example, or when we measure attitudes toward social media platforms, we can’t see people’s trust. There isn’t a trust X-ray machine that allows us to understand whether people trust what a company says it will (or won’t) do, or whether they trust the people on the platform. Trust, like usability or satisfaction, can’t be observed directly. You can’t see these variables; they’re hidden—or using a fancier term, they’re latent.
To measure latent variables, we have to use something we can see to make inferences about something we can’t see. To measure trust, usability, and other constructs such as satisfaction and loyalty, we typically use rating scales that are the observed variables (observed in that we can observe the responses, not the attitudes). These observed variables are more of the effects or shadows of hidden variables. We use the intercorrelations between the items and advanced statistical techniques, such as factor analysis, latent class analysis (LCA), structural equation modeling (SEM), and Rasch analysis, to be sure what we can observe is a reasonable proxy for what we can’t observe.
When I developed the SUPR-Q’s trust factor<https://digitalcommons.du.edu/etd/1166/>, we found two questions that asked about trust and credibility clustered together and differentiated well between websites that had other indicators of high trust or low trust.
Figure 2: Diagram showing how observed questions map to latent constructs of usability, trust, loyalty, and visual appearance
Extraneous (Nuisance) Variables
In the airline website and dating website examples above, we have good evidence that the manipulation of the independent variables (the website or task) caused the changes in the completion rate or trust scores. But other things might be causing the changes, or at least interfering, with the completion rate. For example, even though people were randomly assigned to the different airline websites, it could be that some participants had more experience on American or on Delta and this prior experience affected their completion rates.
This unintended effect is often called an extraneous or nuisance variable. Extraneous variables aren’t things we’re manipulating but they may distort our dependent variables. Prior experience may also be affecting the trust scores on the dating websites. Even though we only used existing users of the websites, the experience could be higher for eHarmony than PlentyOfFish. Prior experience is a common extraneous variable in UX research.
Market and macro-economic changes can have big impacts on business metrics. For example, we recently helped a financial services company differentiate understand how changes in the stock market (a nuisance variable) affected their Net Promoter Score.
Another common extraneous variable is a sequence effect. When you administer multiple tasks to participants in a study, the prior task experience may affect (positively or negatively) performance on subsequent tasks. This may have affected the performance in the seat selector task as it was administered after finding the flight; although, if anything, it more plausibly would improve the performance as participants had more time to become familiar with the website. The sequence of tasks (first find a flight and then select a seat) while mapping what happens when people really book a flight may also cause some dependencies between the tasks. Participants who failed to find the correct flight (lowest cost and earliest arrival time) may have more trouble picking seats if the flight had fewer seats.
While you can’t get rid of extraneous variables, there are ways of handling them when you design your study (such as counterbalancing the order of tasks or websites and when drawing conclusions (such as statistically controlling for prior experience). One of the best things is to measure them, such as having a good measure of prior experience (as we did with the dating and airline studies) and changing the order in which tasks or websites are presented so you can see whether order has an effect on the dependent variables.
Dependent variables depend (change) on how the independent variables are manipulated or selected. Dependent variables can be observed directly, such as task times or completion rates, or be latent (hidden) variables that are measured indirectly, often using questionnaire items from the SUS or SUPR-Q. Don’t look to your dependent variables alone to tell you what to do or fix. Instead understand how changes to the independent variable (such as comparing designs) change the dependent variable (for better or worse).
Each time you set out to measure the user experience, think about what extraneous variables may be affecting your dependent variables (such as prior experience or sequence effects) and be sure to measure them. Here’s a mnemonic to remember: Dependent variables are what you get, independent variables are what you set, and extraneous variables are what you can’t forget (to account for).
PURE is an analytic—as opposed to empirical—method, meaning that it doesn’t use data collected directly from observing participants (not to be confused with analytics). Instead, evaluators (ideally, experts in HCI principles and the product domain) decompose tasks users perform through an interface and rate the difficulty on a three-point scale to generate a task and product PURE score. The output is both an executive-friendly dashboard and a diagnostic set of issues uncovered as part of the task review (as shown in Figure 1).
Figure 1: An example PURE scorecard for a product experience.
We’ve found PURE scores highly correlate with task and study level metrics collected from users across several websites and apps. It’s quickly become one of the more popular training courses since the method was introduced. As part of our training (and upcoming book), we discuss the origins of the method: from the factory floor to Xerox PARC and CU Boulder.
1900s: Time and Motion Studies
At the end of the 1800s, western economies were transforming as part of the industrial revolution as workers left farms to work in increasingly mechanized factories.
Time how long each element takes (averaged across people under normal working conditions).
Create a standard set of times for common steps, such as
o Time to lift part from floor to table
o Time to put on screw and bolt
o Time to remove part to floor
Add the times for an estimate of the total task.
Despite some concerns that time and motion studies would be used as a way to punish workers for slow work, it ultimately has been a successful tool for improving the efficiency (and safety) of work by removing unnecessary steps (wasted time and motions).
In the 1970s, researchers at Xerox PARC and Carnegie Mellon extended the idea of decomposing tasks and applying time and motion studies to human-computer interaction. They developed a method called GOMS (Goals, Operators, Methods, and Selection Rules) that was also a technique meant to reduce unnecessary actions to make software more efficient for the burgeoning computer-using workforce.
GOMS was described in the still highly referenced (but dense) text The Psychology of Human Computer Interaction by Card, Moran, and Newell. GOMS itself represents a family of techniques, the most familiar and accessible of which is Keystroke Level Modeling (KLM). Card, Moran, and Newell conducted their own time and motion studies to build a standardized set of time it would take the typical user to perform actions on a computer (without errors). For example, they found the average time to click a key on a keyboard was 230 milliseconds (about a quarter of a second) and applied Fitts’s law to predict the time it takes to point with a mouse.
With KLM, an evaluator can estimate how long it will take a skilled user to complete a step in a task using only a few of the standard operators (pointing, clicking, typing, and thinking). For a simple introduction to using KLM, see Humane Interface, p. 72.
KLM has been shown to estimate error-free task time to within 10% to 30% of actual times. These estimates can be made from working products, prototypes, or screenshots without needing to collect data directly from users (which is ideal when it’s difficult to test without users). It has been tested on many interfaces and domains such as websites, maps, PDAs, database applications, and was recently updated for mobile interactions [pdf].
1990s: Heuristic Evaluations and Cognitive Walkthroughs
The 1990s gave rise to two other analytics techniques: the heuristic evaluation and cognitive walkthrough. The heuristic evaluation is still one of the most commonly used methods by UX researchers, although in practice, most people are performing a more generic expert review.
In a heuristic evaluation, an expert in usability principles reviews an interface against a set of broad principles called heuristics. These heuristics were derived from analyzing the root causes of problems uncovered in usability tests. Evaluators then inspect an interface to determine how well it conforms to these heuristics and identify shortcomings. The most famous set of heuristics were derived by Nielsen and Molich but there are other heuristics.
The cognitive walkthrough is a usability inspection method similar to a heuristic evaluation and developed around the same time. The cognitive walkthrough has more of an emphasis on task scenarios than the heuristic evaluation. It was developed by Wharton et al. at the University of Colorado in 1990 [pdf]. Whereas the KLM predicts experienced (error-free) task time, the cognitive walkthrough’s emphasis is on learnability for first time or occasional users.
As part of conducting a cognitive walkthrough, an evaluator must first identify the users’ goals and how they would attempt to accomplish them in the interface. An expert in usability principles then meticulously goes through each step, identifying problems users might encounter as they learn to use the interface.
For each action a user has to take, a reviewer needs to describe the user’s immediate goal and answer and address eight questions and prompts:
First/next atomic action user should take.
How will user access description of action?
How will user associate description with action?
All other variable actions less appropriate?
How will user execute the action?
If timeouts, time for user to decide before timeout?
Execute the action. Describe system response.
Describe appropriate modified goal, if any.
It may come as no surprise that one of the biggest complaints about using the CW method is how long it takes to answer each question. Wharton et al. later refined the questions to four:
Will the user try to achieve the effect that the subtask has?
Will the user notice that the correct action is available?
Will the user understand that the wanted subtask can be achieved by the action?
Does the user get appropriate feedback?
Spencer (2000) further reduced the number of questions in his Streamlined Cognitive Walkthrough [pdf] technique in which you ask only two questions at each user action:
Will the user know what to do at this step?
If the user does the right thing, will they know that they did the right thing, and are making progress towards their goal?
Spencer found that by reducing the number of questions and setting up ground rules for the review team he was able to make CW work at Microsoft.
For more information on how inspection methods compare, see the paper [pdf] by Hollingsed and Novick (2007) that also contains one of the largest collections of references on usability inspection methods.
2016: The PURE Evolution
PURE both shares and extends the methods of KLM, cognitive walkthroughs, and heuristic evaluations.
Some common elements shared across these methods and adapted for PURE include
Analytic methods: The PURE method, like the others described here, is analytic, not empirical like usability testing or card sorting. However, even though users aren’t directly observed, the methods are derived based on observations of user behavior such as common mistakes or the time it takes skilled users to perform actions.
Balance of new and existing users: Whereas KLM focuses on experienced users and the cognitive walkthrough focuses on new users, the PURE method is more flexible in that you can apply it to users across the spectrum from new to experienced. As part of applying the PURE method, you can calibrate your evaluation based on the technical skill and product/domain knowledge of the user.
A focus on tasks: PURE, like KLM and cognitive walkthroughs, focuses on the key tasks users are trying to accomplish in an interface. Even the heuristic evaluation method has morphed to include a focus on tasks. With software, apps, and websites getting increasingly complicated, it’s essential to understand the critical tasks users are trying to perform, using methods like a top-tasks analysis.
Multiple evaluators: Where there is judgment there is disagreement. But different perspectives can be an advantage as combining different qualified perspectives about an interface exploits the wisdom of the crowd. Using at least two (usually no more than 5) evaluators enhances the efficacy of heuristic evaluation and PURE .
Double experts help: Evaluators who know something about human computer interaction and design best practices AND the domain being studied will often be the best evaluators. These “double experts” will likely better know which terms are familiar to the audience and will be better acquainted with the tasks and user goals—an advantage with heuristic evaluations. For example, an evaluator with experience in HCI principles AND accounting software and finance will likely be better at identifying problems and generating accurate PURE scores.
A relatively new addition is the HEART framework, derived by a team of researchers at Google. And when Google does something, others often follow.
HEART (Happiness, Engagement, Adoption, Retention, and Task Success) is described by Rodden et al. in a 2010 CHI Paper [pdf], which was written after applying the framework to 20 products at Google.
The HEART framework is meant to help integrate behavioral and attitudinal metrics into something scalable for large organizations (such as Google) that have many software and mobile apps.
Of course, most businesses don’t have a shortage of metrics. The problem, the authors argue, is knowing the right way to use those metrics to manage the user experience of many products.
For example, PULSE metrics (Page views, Uptime, Latency, Seven-day activity, and Earnings) are heavily tracked but indirectly related to the user experience and thus hard to use as dependent variables when making design changes. For example, are more page views a result of increased advertising or a problem with the design? And a perennial question: Is more time logged on the app better or worse?
Nothing helps adoption of a framework like a good acronym and some puns. You can’t have a PULSE without a HEART and the HEART is what Rodden et al. propose to make the most of the PULSE metrics. It’s composed of:
Engagement: These metrics, such as the number of visits per user per week or the number of photos uploaded per user per day, provide some measure of engagement. These may not apply for enterprise users though, especially when the user doesn’t get to decide whether they must use the technology.
Adoption and Retention: New users per time period (e.g. month) describe the adoption rate whereas the percent of users after a specific duration (e.g., 6 months or 1 year) would be the retention rate.
Task Success: These are the commonly used metrics—such as task-completion rate, task time, and errors—that are already collected during most usability tests.
The PULSE and the HEART are woven into more conventional business ideas: Goals, Signals, and more Metrics (but without a catchy acronym). Rodden et al. describe the GSM as:
Goals: Critical tasks users want or need to accomplish.
Signals: Where you will find metrics (such as usage data and surveys) that would indicate whether the goals were met.
Metrics: The PULSE metrics, such as number of users using a product over a period of time.
Together this has been presented as a matrix with HEART, PULSE, and GSM as shown in Figure 1 (with examples we’ve added).
Figure 1: The HEART, PULSE metric examples integrated within a GSM matrix.
HEART Builds on Other Frameworks
There’s a temptation to dismiss new frameworks or methods on the one hand or jump after the next shiny thing on the other (especially when it’s advocated by successful companies). I see methods, frameworks, and metrics such as HEART as adaptations rather than replacements for existing ones. A good new method should build on existing ones and evolve it to be more efficient, effective, or specialized.
While the authors don’t explicitly link HEART to other frameworks and methods, it’s not hard to see some connection. For example, the HEART framework certainly isn’t the first to associate UX metrics to business metrics and user goals. And over 30 years ago, using metrics in the design phase was one of the key principles identified by Gould and Lewis.
Many organizations are already collecting the NPS and conducting regular usability testing that contain the task-based metrics of completion rates and task times. We actually collected usage data at the same time as the HEART publication. These usability metrics are part of the ISO 9241 pt 11 definition of usability. Finally, the top-tasks analysis is also a great way at defining critical user goals (the G in the GSM).
The TAM suggests that people will adopt and continue to use technology (the EAR of the HEART) based on the perception of how easy it is to use (H), how easy it actually is to use (T), and whether it’s perceived as useful (H). The SUS, SEQ, and the ease components of the UMUX-Lite and SUPR-Q are all great examples of measuring perceived ease (and bring the Happiness to the HEART model).
I’ve mapped together what I see as the overlaps among the TRA, TAM, and other commonly collected UX metrics in Figure 2.
Figure 2: Linking the HEART framework with other existing models (TAM, TRA, and metrics that link them).
HEART Is Light on Validation
As we encounter new methods, we like to look for any data that validates why the method works and how it might be better or at least comparable to existing ones. The foundational paper offered some examples of how to apply aspects of HEART, but it didn’t offer much in the way of validation data. For example, it would have been good to see stats, such as how teams at Google that used the HEART method achieved their goals x% faster, or engaged employees, or even resulted in more successful (more revenue, less time to market) products. While this sort of data can be hard to collect, Google has conducted and published similar comparison studies on the decision behind hiring personnel. If you have any data, please share it!
But even without data showing that the HEART method is better or more effective, it does have one thing going for it: It’s a relatively simple framework with a memorable name. Executives and product teams are often so overwhelmed with data and methods that simple and good enough get used more than complicated and perfect. This is especially important when trying to measure many disparate products across large organizations (such as enterprise software).
Simplicity and memorability are likely one of the reasons the NPS became so popular. The “How likely are you to recommend?” question has been asked for decades as part of measuring customer word of mouth. Yet it took Fred Reichheld and the well-branded Net Promoter Score to get executives to care about customers’ behavioral intentions. If the HEART framework has a similar effect of making UX teams more successful at using data to make design decisions, then I can learn to love the HEART too!
Summary and Takeaways
A review of a framework for measuring the user experience on a large scale revealed:
The HEART framework. HEART is a way to use attitudinal measures (Happiness) to predict adoption and usage statistics (Engagement, Adoption, and Retention) that can be tracked using task-based measures in usability testing such as Task completion rate and task time.
The HEART framework builds off existing models. Even though it wasn’t explicitly linked to existing models, HEART extends the thinking of the Technology Acceptance Model (TAM), which is itself an extension of the Theory of Reasoned Action and uses many common UX ISO 9241 metrics.
You’re likely already using many of the components. Many organizations are already collecting UX metrics in usability tests, collecting behavioral intentions (e.g., NPS), and measuring usage and retention and even conducting top-tasks analysis to align efforts to user goals.
If the HEART gets you more UX love, use it. If the HEART framework helps get your organization to measure more or better prioritize development efforts based on users’ attitude and behavioral intention then you should love the HEART. Without data showing that the framework is better than others, it doesn’t necessarily mean you should drop your current UX measurement framework (if you have one). If you’ve had success with the HEART, share the love!
Smoking precedes cancer (mostly lung cancer). People who smoke cigarettes tend to get lung and other cancers more than those who don’t smoke. We say that smoking is correlated with cancer. Carefully rule out other causes and you have the ingredients to make the case for causation.
Correlation is a necessary but not sufficient ingredient for causation. Or as you’ve no doubt heard: Correlation does not equal causation. A correlation quantifies the association between two things. But correlation doesn’t have to prove causation to be useful. Often just knowing one thing precedes or predicts something else is very helpful. For example, knowing that job candidates’ performance on work samples predicts their future job performance helps managers hire the right candidates. We’d say that work sample performance correlates with (predicts) work performance, even though work samples don’t cause better work performance.
A common (but not the only) way to compute a correlation is the Pearson correlation (denoted with an r), made famous (but not derived) by Karl Pearson in the late 1880s. It ranges from a perfect positive correlation (+1) to a perfect negative correlation (−1) or no correlation (r = 0). In practice, a perfect correlation of 1 is completely redundant information, so you’re unlikely to encounter it.
The correlation coefficient has its shortcomings and is not considered “robust” against things like non-normality, non-linearity, different variances, influence of outliers, and a restricted range of values. Shortcomings however, don’t make it useless or fatally flawed. Consequently, it’s widely used across many scientific disciplines to describe the strength of relationships because it’s still often meaningful. It’s sort of the common language of association as correlations can be computed on many measures (for example, between two binary measures or ranks).
Returning to the smoking and cancer connection, one estimate from a 25-year study on the correlation between smoking and lung cancer in the U.S. is r = .08 —a correlation barely above 0. You may have known a lifelong smoker who didn’t get cancer—illustrating the point (and the low magnitude of the correlation) that not everyone who smokes (even a lot) gets cancer.
But one study is rarely the final word on a finding and certainly not a correlation. There are many ways to measure the smoking cancer link and the correlation varies some depending on who is measured and how.
For example, in another study of developing countries, the correlation between the percent of the adult population that smokes and life expectancy is r = .40, which is certainly larger than the .08 from the U.S. study, but it’s far from the near-perfect correlation conventional wisdom and warning labels would imply.
While correlations aren’t necessarily the best way to describe the risk associated with activities, it’s still helpful in understanding the relationship. But importantly, understanding the details upon which the correlation was formed and understanding their consequences are the critical steps in putting correlations into perspective.
Validity vs. Reliability Correlations
While you probably aren’t studying public health, your professional and personal life are filled with correlations linking two things (for example, smoking and cancer, test scores and school achievement, or drinking coffee and improved health). These correlations are called validity correlation. Validity refers to whether something measures what it intends to measure. We’d say that a set of interview questions that predicts job performance is valid. Or a usability questionnaire is valid if it correlates with task completion on a product. The strength of the correlation speaks to the strength of the validity claim.
At MeasuringU we write extensively about our own and others’ research and often cite correlation coefficients. However, not all correlations are created equal and not all are validity correlations. Another common correlation is the reliability correlation (the consistency of responses) and correlations that come from the same sample of participants (called monomethod correlations). Monomethod correlations are easier to collect (you only need one sample of data) but because the data comes from the same participants the correlations tend to be inflated. Reliability correlations also tend to be both commonly reported in peer reviewed papers and are also typically much higher, often r > .7. The availability of these higher correlations can contribute to the idea that correlations such as r =.3 or even r = .1 are meaningless.
For example, we found the test-retest reliability of the Net Promoter Score is r = .7. Examples of a monomethod correlation are the correlation between the SUS and NPS (r = .62), between individual SUS items and the total SUS score (r = .9), and between the SUS and the UMUX-Lite (r = .83), all collected from the same sample and participants. These are also legitimate validity correlations (called concurrent validity) but tend to be higher because the criterion and prediction values are derived from the same source.
Interpreting Validity Correlation Coefficients
Many fields have their own convention about what constitutes a strong or weak correlation. In the behavioral sciences the convention (largely established by Cohen) is that correlations (as a measure of effect size, which includes validity correlations) above .5 are “large,” around .3 are “medium,” and .10 and below are “small.”
Using the Cohen’s convention though, the link between smoking and lung cancer is weak in one study and perhaps medium in the other. But even within the behavioral sciences, context matters. Even a small correlation with a consequential outcome (effectiveness of psychotherapy) can still have life and death consequences.
Squaring the correlation (called the coefficient of determination) is another common practice of interpreting the correlation (and effect size) but may also understate the strength of a relationship between variables, and using the standard r is often preferred. We’ll explore more ways of interpreting correlations in a future article.
I’ve collected validity correlations across multiple disciplines from several published papers (many meta-analyses) that include studies on medical and psychological effects, job performance, college performance, and our own research on customer and user behavior to provide context to validity correlations. Many of the studies in the table come from the influential paper by Meyer et al. (2001).
The blockbuster drug (and TV commercial regular) Viagra has a correlation of r = .38 with “improved performance.” Psychotherapy has a correlation of “only” r = .32 on future well-being. Height and weight that are traditionally thought of as strongly correlated have a correlation of r = .44 when objectively measured in the US or r = .38 from a Bangladeshi sample. That’s not that different than the validity of ink-blots in one study. The connection between the “pulse-ox” sensors you put on your finger at the doctor and actual oxygen in your blood is r = .89. All these can be seen in context with the two smoking correlations discussed earlier, r = .08 and r = .40.
Table 1 shows correlations for several indicators of job performance, including college grades (r = .16), years of experience (r = .18), unstructured interviews (r=.38), general mental ability (r = .51); the best predictor of job performance is work samples, r =.54. See How Google Works for a discussion of how Google adapted its hiring practices based on this data.
Like smoking, the link between aptitude tests and achievement has been extensively studied. Table 1 also contains several examples of correlations between standardized testing and actual college performance: for Whites and Asian students at the Ivy League University of Pennsylvania (r = .20), College GPA for students in Yemen (r = .41), GRE quantitative reasoning and MBA GPAs (r = .37) from 10 state universities in Florida, and SAT scores and cumulative GPA from the Ivy League Dartmouth College for all students (r = .43).
Customer and User Behavior
I’ve included several validity correlations from the work we’ve done at MeasuringU, including the correlation between intent to recommend and 90 day recommend rates for the most recent purchase (r = .79), SUS scores and software industry growth (r = .74), the Net Promoter Score and growth metrics in 14 industries (r = .35), evaluators’ PURE scores and users’ task-ease scores (r = .67). Similar correlations are also seen between published studies on peoples’ intent to purchase and purchase rates (r = .53) and intent to use and actual usage (r = .50) as we saw with the TAM.
The lesson here is that while the value of some correlations is small, the consequences can’t be ignored. And that’s what makes general rules of correlations so difficult to apply. My hope is the table of validity correlations here from disparate fields will help others think critically about the effort to collect and the impact of each association.
Summary and Takeaways
This discussion about the correlation as a measure of association and an analysis of validity correlation coefficients revealed:
Correlations quantify relationships. The Pearson correlation r is the most common (but not only) way to describe a relationship between variables and is a common language to describe the size of effects across disciplines.
Validity and reliability coefficients differ. Not all correlations are created equal. Correlations obtained from the same sample (monomethod) or reliability correlations (using the same measure) are often higher r (r > .7) and may lead to an unrealistically high correlation bar.
Correlations can be weak but impactful. Even numerically “small” correlations are both valid and meaningful when the contexts of impact (e.g., health consequences) and effort and cost of measuring are accounted for. The smoking, aspirin, and even psychotherapy correlations are good examples of what can be crudely interpreted as weak to modest correlations, but where the outcome is quite consequential.
Don’t set unrealistically high bars for validity. Understanding the context of a correlation helps provide meaning. If something can be measured easily and for low cost yet have even a modest ability to predict an impactful outcome (such as company performance, college performance, life expectancy, or job performance), it can be valuable. The “low” correlation between smoking and cancer (r = .08) is a good reminder of this.
Thanks to Jim Lewis for providing comments on this article.
Articles and conference presentations tend to echo what many researchers have experienced: watching a user have a horrible experience with a product but still rate it highly on ease and likelihood to use or recommend scales. How can we put any faith in these forward-looking opinions when that happens?
But there’s a very strong business need to know what current and prospective customers want or will use and purchase.
Other methods, such as the MaxDiff and Kano, are popular with many companies (and our clients) for helping understand what features people want or need. The top-task analysis advocated by Gerry McGovern (and us) also relies on what users say, their opinions and attitudes about what’s important, and how they intend to do things.
Can we trust the measures of people’s attitudes? Do attitudes towards ease, usefulness, trust, and likelihood to use actually predict future behavior? Or should we just ignore what people say?
To understand how well attitudes (what people think and say) predict behavior (what people do) we need a good definition of attitude and understand how its measurement has evolved.
Human attitude has been studied for decades and one helpful model that’s emerged to describe attitude is the tripartite (or ABC model) model of attitudes. The ABC model, as we wrote about, deconstructs attitude into three parts: Affective (how people feel), Behavioral intentions (what people intend to do, also called conative), and Cognitive (what people think). Or you can think of attitudes as beliefs, feelings, and intentions (see Figure 1).
Figure 1: Tripartite model of attitude.
The reason attitude has been studied intensely is that it was believed to be the key to understanding why people do what they do (and predict what they will do). It’s a cornerstone of social psychology.
But when experiments started to show that measuring what people think didn’t perfectly (or even modestly) predict what they would do, many researchers questioned the validity of using attitudes to predict behavior.
For example, in the late 1960s, Allan Wicker [pdf] examined 42 experimental studies, which showed at best a weak correlation between attitude and behavior (rarely above r = .30). He concluded:
‘‘It is considerably more likely that attitudes will be unrelated or only slightly related to overt behaviors than that attitudes will be closely related to actions.’’ (p. 65)
Wicker’s meta-analysis and its stark conclusion was influential to many researchers who doubted the link between attitude and behavior and connecting the two lost favor during the 1970s.
But rarely does one study become the final word on a subject, even one based on a meta-analysis. Other researchers re-examined how attitude and behavior were measured in the studies cited by Wicker and identified some shortcomings.
In 1995, Kraus conducted another influential meta-analysis and found reasonable support for the link between behavior and attitude with an average correlation of r = .38 across 88 studies. He also found that not all studies are created equal and several variables would affect (moderate) the relationship between attitude and behavior. For example, correlations were higher (r > .5) when researchers didn’t use college students (the favored captive audience for the academic social scientist) and when the attitude and behavior were measured at the same level of specificity.
Ajzen has shown that part of the reason Wicker’s analyses failed to find strong links between behavior and action was the failure to aggregate behaviors and the use of broad measures to predict specific actions. That is, it can be very difficult to predict individual instances of behavior (Will this one person purchase from Amazon.com this week?) from a general measure of attitude (satisfaction with Amazon). Aggregating across people and time reveals patterns (people more satisfied with the Amazon website tend to purchase more). Even better, matching the question specificity with the behavior (Will you purchase at Amazon.com this week? compared to purchase rates across people) should also improve the relationship.
To better model the attitude to behavior connection, Ajzen extended the Theory of Reasoned Action (TRA), which was used in the development of the Technology Acceptance Model, into something he called the Theory of Planned Behavior (TPB), as shown in Figure 2.
Figure 2: Ajzen’s model of the Theory of Planned Behavior.
With the Theory of Planned Behavior, Ajzen argues that while attitudes will themselves correlate with behavior, attitudes acting through behavioral intention usually tend to predict behavior better. That is, if people have an attitude (donating blood save lives), say they will do something (I will donate blood), they generally will (blood donated), with some caveats.
Other factors, for example the acceptability of the behavior (referred to in the model as subjective norm), will also play a role. In other words, if you want to know whether people will purchase at Walmart.com, ask people how likely they are to purchase at Walmart.com and whether they think it’s acceptable to purchase at Walmart.
The Theory of Planned Behavior is not without criticism as others have argued that beliefs themselves, under some circumstances, may be better predictors of action (as the Kraus meta-analysis showed).
But all theories are wrong and some are useful. The TPB has shown to be a reasonable model for predicting behavior. While the model will continue to evolve, Ajzen has put forth some compelling evidence that stated intentions can indeed predict behavior.
How Well Do Intentions Predict Behavior?
Ajzen, in his book Attitudes, Personality and Behavior, argues that when focusing on stated intention, behavior can be predicted reasonably well. Across several meta-analyses, and even meta-analyses of meta-analyses by Sheeran (2002), the average intention-to-behavior correlation is r = .53. Using one interpretation of this correlation, intentions can predict or explain about 28% of the variability in behaviors.
Table 1 shows correlations from several studies Ajzen cited, along with correlations from our own research on attitudes and behavior correlations and one other on predicting product usage.
For example, the correlation between intention to donate blood and actual donation in one study was strong at r = .75. The table also includes our three correlations from likelihood to recommend and reported recommendations over 90 days with correlations ranging from r = .69 (recently recommended product), r = .79 (recently purchased product), to r = .90 (select brands).
The correlation found that stated intention to use technology and actual adoption in one study was r = .5 (see the TAM article).
Table 1: Selection of studies showing correlations between stated intentions and future behavior. * Studies are from Ajzen (2005, p. 100).
Note: The correlation is a measure of the strength of a linear relationship and is often measured using the Pearson “r” that can range from no correlation (r = 0) to r = 1 or r = -1, which is a perfect positive or perfect negative relationship. In general, correlations about .5 are considered “large” in the behavioral sciences.
Table 1 also includes two correlations specifically for people’s likelihood to purchase (will consumers purchase a product). These are from a meta-analysis by Morwitz et al. (2007) that found that intentions predicted purchase behavior best for existing rather than new products, for durable products, specific brands vs. brand category, over shorter periods, and when the data is collected comparatively (product A vs B) rather than by itself.
Intentions weren’t always a good predictor. Most notably they found intentions to purchase new products were a poor predictor (the correlation was non-significant and negative).
Table 1 also includes our earlier analysis of aggregated SUPR-Q scores on future purchase rates. The SUPR-Q actually uses a mix of beliefs and feelings (ease, trust, appearance) and behavioral intention (likelihood to return and likelihood to recommend). It’s likely we would have seen stronger correlations between attitude and behavior if we examined purchase intentions. We’ll investigate ways of measuring purchasing behavior in more detail in future articles.
Do Attitudes Predict Behavior?
This data suggests that, in general, assuming people are acting on their own free will, they will follow through on their intentions. But not always. Correlations shown in Table 1 and across the meta-analyses cited are strong (r > .5), placing it among the strongest effects in behavioral sciences.
But even a strong correlation still means there will be a lot of discrepancy between intentions and actions. This is especially the case when trying to predict individual behavior from individual general attitudes. But just because attitudes aren’t a perfect predictor (that’s an unrealistic bar) doesn’t mean they aren’t useful. Don’t be afraid to ask what people think and will do but understand their limitations.
Summary and Conclusions
An analysis of the history and measurement of the attitude to behavioral connection found:
Attitude is a compound construct. Think of attitude as being composed of what people think and feel and intend to do. People’s thoughts and feelings affect their behavior. One popular model (the Theory of Planned Behavior) suggests thoughts and feelings (plus social acceptability) influence intentions that in turn affect behavior.
Intentions may predict behavior better than feelings and beliefs. For the most part, people’s intentions are a good (but far from perfect) predictor of behaviors under many circumstances. On average, behavioral intentions can explain a substantial amount of variation in future behavior. Across many studies the average attitude to behavior correlation is r = .53 and in many cases the correlation exceeds r = .7. This relationship even applies to what people will use and purchase.
Attitudes are not perfect predictors of behavior. Don’t expect all attitudes or intentions to always predict behaviors well. Attitude become a poorer predictor of behavior when asking about general intentions (intent to exercise) instead of measuring specific actions (e.g., climbing stairs or lifting weights) and when the behavior isn’t stable (when things change between intention and behavior). When asking about intention to use or purchase products, intentions may only be reliable with an existing rather than a new product.
You’ve probably heard of the infamous Stanford Prison Experiment by Philip Zimbardo (especially if you took an intro psych class).
The shocking results had similar implications to the notorious Milgram experiment and suggested our roles may be a major cause for past atrocities and injustices.
You might have also heard about research from Cornell University that found, across multiple studies, that simply having larger serving plates make people eat more!
It’s hard to question the veracity of studies like these and others that have appeared in peer-reviewed, prestigious journals coming from well-educated, often tenured professors from world-class academic institutions and often funded by the government or other non-profit grants.
Yet the methodology of the Stanford prison experiment has come into question and those serving portion studies have been retracted along with many others, and the author was forced to resign from Cornell.
Fortunately, it’s less common to find studies that are redacted based on overtly disingenuous methods and analysis (like egregious p-hacking). It’s more common, and likely more concerning, that a large number of studies’ findings can’t be replicated by others.
The replication problem is only for peer-reviewed studies. What about more applied methods and research findings that are subject to less academic rigor? For example, the Myers-Briggs personality inventory has been around for decades and by some estimates is used by 20% of Fortune 1000 companies. But the validity—and even basic reliability—of the Myers-Briggs has been called into question. Some feel passionately that it should “die.” Others point out that many of the objections are unfounded and there’s still value.
The Net Promoter Score, of course, has also become a widely popular metric with many supporters and vocal critics in academia and industry.
To compound matters, there’s a natural friction between academia’s emphasis on internal validity from controlled settings and the need for applied research to be more generalizable, flexible, and simple.
It’s tempting to take an all or nothing approach to findings:
If there’s a flaw with a study, all the findings are wrong.
If it’s published in a peer-reviewed journal, it’s “science” and questioning it makes you a science denier.
If it’s not peer reviewed, it can’t be trusted.
But failing to replicate doesn’t necessarily mean all the original findings are worthless or made up. It often just means that the findings may be less generalizable due to another variable, or the effects are smaller than originally thought.
Thinking in terms of replication is not just an exercise for academics or journal editors. It’s something everyone should think about. It’s also something we do at MeasuringU to better understand the strengths and weakness of our methods and metrics.
Our solution is more replication to understand the findings carefully and then document our methods and limitations for others to also replicate. We publish the findings on our website and in peer-reviewed journals.
Here are nine examples of claims or methods that we’ve attempted to replicate. Some we found similar results for, and we failed to replicate others, but with all we learned something and hope to extend that knowledge for others to leverage.
Top-Tasks Analysis to Prioritize
Prioritizing tasks is not a new concept, and there are many methods for users rating and ranking. But having users force-rank hundreds or thousands of features individually would be too tedious for even the most diligent of users (or even the most sophisticated, choice-based conjoint).
Gerry McGovern proposed a unique way of having users consider a lot of features in The Stranger’s Long Neck: Present users with a randomized (often very long) list and have them pick five tasks. I was a bit skeptical when I read it a few years ago (as are other survey experts, Gerry says). But I decided to try the method and compare it to other more sophisticated approaches. I found the results generally comparable. Since then, I’ve used his method of top-task ranking for a number of projects successfully.
Suggestions: It’s essential to randomize the order of the tasks and we found similar results whether you have respondents “pick” their top five tasks or “rank” their top five. And there’s nothing sacred about five—it could be three, two, or seven—but it should be a small number relative to the full list.
Reporting Rating Scales Using Top-Box Scoring
Marketers have long summarized rating scale data by using the percentage of responses that select the most favorable response (which before the web used to mean checking a box on paper). This “top-box” scoring though loses a substantial amount of information as 5, 7, and 11-point scales get turned into 2-point scales. I’ve been critical of this approach because of the loss of information. However, as I’ve dug into more published findings on the topic, I found that behavior is often better predicted from the most extreme “top-box” responses. This suggests that there’s something more to top-box (and bottom-box) scoring—more than just executive friendliness. For example, we’ve found that people who respond with a 10 (the top box) are the most likely to recommend companies and the top SUPR-Q responders are most likely to repurchase.
Caveat: While a top-box approach may predict behavior better, it’s not always better than summarizing with the mean (or certainly rejecting the mean). If you’re looking to predict behavior, the extremes may be superior to the mean but it’s not that hard to track both.
A Lostness Measure for Findability
Tomer Sharon breathed new life into a rarely used metric that was proposed in the 1990s when hypertext systems were coming of age. This measure of how lost users are when navigating is computed using the ratio of number of unique pages and total pages visited to the minimum “happy path.” The original validation data was rather limited, using only a few high school studies to validate the measure on a college hypercard application. How much faith should we have in such a small, hardly representative sample? To find out, we replicated the study using 73 new videos from US adults on several popular websites and apps. We found the original findability thresholds were in fact, a reasonable proxy for getting lost.
Suggestions: Asking the 1-item SEQ predicted lostness scores with 95% accuracy. If lostness isn’t computed automatically it may not be worth the manual effort of logging paths.
Detractors Say Bad Things about a Company
The rationale behind designating responses of 0 to 6 as detractors on the 11-point Likelihood to Recommend item used for the NPS is that these respondents will be most likely to spread negative word of mouth. Fred Reichheld reported that 80% of the negative comments came from these responses in his research.
In our independent analysis we were able to corroborate this finding. We found 90% of negative comments came from those who gave 0 to 6 on the point scale (the detractors).
Caveat: This doesn’t mean that 90% of comments from detractors are negative or that all detractors won’t purchase again, but it means that the bulk of negative comments do come from these lower scores.
The CE11 Is Superior to the NPS
Jared Spool has urged thousands of UX designers to reject the Net Promoter Score and instead has advocated for a lesser known 11-item branding questionnaire, the CE11. Curiously, one of the items is actually the same one used in the Net Promoter Score.
We replicated the study described in Jared’s talks but found little evidence to suggest this measure was better at diagnosing problems. In fact, we found the 11-point LTR item actually performed as well as or better than the CE11 at differentiating between website experiences.
Failed to Replicate
Suggestions: This may be a case of using a valid measure (Gallup validated the CE11) in the wrong context (diagnosing problems in an experience).
The NPS Is Superior to Satisfaction
One of the central claims of the NPS is that it’s better than traditional satisfaction measures that were often too lengthy, didn’t correlate with future behavior, and were too easy to game.
We haven’t found any compelling research to suggest that the NPS is always—or in fact ever—better than traditional satisfaction measures. In fact, we’ve found the two are often highly correlated and indistinguishable from each other.
Failed to Replicate
Caveat: Even if the NPS isn’t “superior” to satisfaction, most studies show it performs about the same. If a measure is widely used and short, it’s probably a good enough measure for understanding attitudes and behavioral intentions. Any measure that is high stakes (like NPS, satisfaction, quarterly numbers, or even audited financial reports) increases the incentive for gaming.
SUS Has a Two-Factor Structure
The System Usability Scale (SUS) is one of the most used questionnaires to measure the perceived ease of a product. The original validation data on the SUS came from a relatively small sample and consequently offered little insight into its factor structure. It was therefore assumed the SUS was unidimensional (measuring just the single construct of perceived ease). With much larger datasets, Jim Lewis and I actually found evidence for two factors, which we labeled usability and learnability. This finding was also replicated by Borsci using an independent dataset.
However, in another attempt to replicate the structure, we found the two factors were actually artifacts of the positive and negative wording of the items and not anything interesting for researchers. We published [pdf] the results of this replication.
Replicated then Failed to Replicate
Caveat: This is a good example of a replication failing to replicate another replication study!
Promoters Recommend More
Fred Reichheld reported that 80%+ of customer referrals come from promoters (9s and 10s on the 11-point LTR item). But some have questioned this designation and there was little data to corroborate the findings. We conducted a 90-day longitudinal study and found that people who reported being most likely to recommend (9s and 10s) did indeed account for the majority of the self-reported recommendations.
Caveats: Our self-reported recommendations accounted for less than the 80%+ referral data Reichheld reported and we didn’t use actual referrals, only self-reported recommendations. While promoters might recommend, we found stronger evidence that the least likely to recommend are a better predictor of those who would NOT recommend.
The NPS Predicts Company Growth
When we reviewed Fred Reichheld’s original data on the predictive ability of the NPS we were disappointed to see that the validation data actually used the NPS to predict historical, not future, revenue (as other papers also pointed out). We used the original NPS data and extended it into the future and did find it predicted two-year and four-year growth rates. But it’s unclear how representative this data was and we considered it a “best-case” scenario.
Using 3rd party NPS benchmark data and growth metrics for 269 companies in 14 industries data we again found a similar, albeit weaker relationship. The average correlation between NPS and two-year growth was r = .35 for 11 of the 14 industries and r = .44 when using relative ranks.
Caveats: The NPS does seem to correlate (predict) future growth in many industries most of the time. It’s not a universal predictor across industries or even in the same industry across long periods. The size of the correlation ranges from non-existent to very strong but on average is modest (r=.3 to .4). It’s unclear whether likelihood to recommend is a cause or effect of growing companies (or due to the effects of other variables).
Thanks to Jim Lewis PhD for commenting on this article.
Do they just sound good for executives? How much faith should we put in them?
What Makes Someone a Detractor?
In The Ultimate Question, Reichheld reports that in analyzing several thousand comments, 80% of the negative word-of-mouth comments came from those who responded from 0 to 6 on the Likelihood-to-Recommend item (pg. 30).
In our independent analysis, we were able to corroborate this finding. We found 90% of negative comments came from those who gave 0 to 6 on the 11-point scale (detractors).
Therefore, we found good evidence to support the idea that responses of 0 to 6 are more likely to say bad things about a company and account for the vast majority of negative comments.
What Makes Someone a Promoter?
Reichheld further claims that 80%+ of customer referrals come from promoters. This sort of data can be collected from internal company records—for example, asking how customers heard about the company and associating that to the purchase.
It would be good to corroborate this finding, but we aren’t privy to internal company records. We can look at a similar measure, the reported recommendation rate, across many companies. For example, in our earlier analysis, we examined two datasets: consumer software and a mix of most recently recommended products. We found:
Consumer software: 64% of consumer software customers who recommended a product were promoters.
Most recent recommendation: 69% of consumers who reported on their most recent recommended product or service were promoters.
Digging into the second data set, Figure 1 shows the distribution of the responses on the 11-point LTR question for the 2,672 people who recalled their most recently recommended company.
Figure 1: Distribution of likelihood to recommend from 2,672 US respondents who recently recommended a company or product/service.
Across these two datasets, we do find corroborating evidence that the promoters are most likely to recommend. Taking an average of the two datasets reveals that around 67% of recommenders are promoters. This is lower than the 80% that Reichheld cited but still a substantial number. To get to 80%+ of recommendations we’d need to include the 8s (84%).
The very high percentage of 10s who recommended (52% in Figure 1) is more evidence that extreme attitude responses may be a better predictor of behavior.
While we did find some corroboration to Reichheld’s data, our approach has two methodological drawbacks. First, we used only one point in time to collect past recommendations and future intent to recommend. It could be that respondents who currently rate highly on their likelihood to recommend may also overstate their past recommendations, thus inflating these percentages. Second, we used the recommend rate, which is a bit different than actual referral rates. A referral from a company would be an actual purchase made by someone who was recommended/referred to the company. Just because you recommend a product to a friend doesn’t mean they will actually follow through on the recommendation and make a purchase and that may account for some of the difference.
One way to account for the problem with potential bias from using one timepoint to collect past and future recommend intentions is to conduct a longitudinal study. This will allow us to find out what percent of people who say they will recommend actually do.
Longitudinal Study of Promoter Recommendations
In November 2018 we asked 6,026 participants from an online US panel to rate their likelihood to recommend several common brands, their most recent purchase (n = 4,686), and their most recently recommended company/product (n = 2,763) using the 11-point Likelihood to Recommend item. The common brands shown to participants were
Budget Rent a Car
We rotated the brands so not all participants saw each one. This gave us between 502 and 1,027 likelihood-to-recommend scores per brand.
Examples of companies people mentioned purchasing from or recommending include
Whit’s Frozen Custard
Follow-up 30 to 90 Days Later
We then invited all respondents to participate in a follow-up survey until we collected 1,160 responses at 30 days (n = 560), 60 days (n = 300), and 90 days (n = 300), which took place December 2018, January 2019, and February 2019.
We presented respondents with a similar list of companies rated in the November survey and asked them to select which companies, if any, they had recommended to others in the prior 30, 60, or 90 days (based on the follow-up period).
This list of companies presented also included the company, product, or service the respondents had reported purchasing or recommending from the original survey (using an open-text field and cleaned up grammatically), shown in randomized order. If a respondent mentioned one of the companies on our brand list, we presented an alternate brand to avoid listing their stated company twice.
Participants were therefore asked how likely they are to recommend companies and brands they may or may not have done business with (e.g., Southwest, Budget, or eBay) and by definition, companies they have a relationship with (i.e., most recent purchase and most recent recommendation). The most recent recommendation can be seen as a sort of second recommendation since the original survey.
Recommendations after Recent Purchases and Recommendations
First, we analyzed respondents’ most recent purchase or most recently recommended product. Table 1 shows the number of responses, the number that recommended, the percent that recommended, and the percent of those recommendations that came from promoters at each time interval (and in aggregate).
After 90 days, we had 406 responses for the most recently recommended product (31 responses collected at 30 days, 106 at 60 days, and 269 at 90 days). Of these, 151 (37%) reported recommending the company/product to a friend or colleague at either 30, 60, or 90 days. Also, after 90 days we had 946 responses for the most recently purchased product, of which 471 (50%) reported recommending it to a friend or colleague during the 30, 60, or 90-day interval after our initial survey.
The far right column of Table 1 shows the percent of recommendations that came from promoters. For example, after 30 days, 20 out of 31 people reported recommending their previously recommended company again. Of those 20 recommendations, 17 (85%) came from promoters. At 90 days, 94 out of 269 respondents recommended their most recent company and 77% of those recommendations came from promoters.
# of Responses
% of Rec from Promoters
Table 1: Percent of recommendations coming from promoters for the most recently recommended company/product or the most recently recommended purchase at 30, 60, and 90 days.
Aggregating across the 90 days we found that promoters account for 77% of recommendations from the most recently recommended product and 60% for the most recent purchase. This distribution can be seen in Figure 2.
Figure 2: Percent of all reported recommendations at 30, 60, and 90 days for respondents most recent purchased and recommended product.
Figure 2 shows that most self-reported recommendations do indeed come from promoters (9s and 10s) although the proportion coming from 8s (passives) is indistinguishable from the percentage coming from 9s (and is nominally higher for recent purchases). Similar to Figure 1, to account for 80%+ of recommendations, we’d need to include 8s.
We also again see 10s accounting for a disproportionate amount of the recommendations, continuing to suggest the extreme responders are a better predictor of behavior.
The percentage of recommendations coming from promoters are similar, albeit lower (especially the most recent purchase recommendations), to Reichheld’s reported 80% of recommendations coming from promoters. This difference could be that we’re only using data from a 30, 60, or 90-day window; a larger time period may likely increase the number of recommendations. Research by Kumar et al. found that the optimum referral time was 90 days for telecom companies and 180 days for financial services.
Interestingly though, in looking at Table 1, a similar but often higher percentage of respondents report recommending at 30 days than at 60 or 90 days. This could be because respondents better remember recommendations for a shorter time period or it’s a seasonal anomaly. The time period we collected data for (the first 30 days included Christmas) may also be contributing to the higher short-term rates. We’re also relying on self-reported recommendations, which also may not be as accurate as actually tracking referrals—something a future analysis can explore.
Recommendations from Specified Brands
Next, we looked at the recommend rate for the rotating list of brands we specified. A substantial number of respondents reported having not purchased from several brands so asking about recommendations seemed less relevant. To identify customer relationships, we sent a new survey to 500 of the 1,160 respondents in February 2019 to ask them to specify how many times (or at all) they purchased from the specific brand list in 2018 and how much they estimated they spent.
We therefore focused our analysis on this subset of respondents who reported having made a purchase from the brand in the last year. This reduced the sample size because we rotated out brands at both the initial timepoint when we asked how likely they would be to recommend (November 2018) and then only asked a subset at 30, 60, and 90 days whether they recommended.
Table 2 shows a breakdown of the responses by brand (multiple responses allowed per respondent). For example, 475 people were asked the LTR for eBay in Nov 2018 but only 142 were customers (those who reported making a purchase in the prior year). Of the 142 customers who answered the LTR, 29 (20%) reported recommending the company in the prior period. Of these 29 recommenders, 14 (48%) were promoters.
Table 2 shows that 51% of all recommendations came from promoters. This percentage is lower than the 60% we found with the most recent purchase, lower than the 80% Reichheld referral rate, and lower than the 77% for respondents most recently recommended brand.
There is variation across the brands, with a high of 57% of recommendations coming from Amazon promoters and a low of 0% for Budget. The low percentages for Southwest and Budget are likely a consequence of very few overall recommendations for these brands in this sample. For example, Budget had only two customers who recommended; both of them were not promoters. With such small numbers of recommendations, wide swings in percentages are expected by brand but the aggregate gives a more stable estimate. A future analysis can include a larger number of customers.
# of Customers
# of Cust. That Recommended
% of All Customer That Recommended
% Recommendations from Promoters (Cust Only)
Table 2: Percent of recommendations from promoters at either 30, 60, or 90 days for selected brands for customers only.
Promoters’ Share of Recommendations
So, are promoters more likely to recommend? The answer is a clear yes. Table 3 shows the percent of recommendations from detractors, passives, and promoters for respondents most recently recommended product, most recent purchase, and across common brands. Promoters account for between 2.6 times (51% vs. 19%) and 16 times (77% vs. 5%) as many recommendations as detractors.
% Recommendations from:
Table 3: Percent of recommendations for detractors, passives, and promoters.
5s and 6s Account for Most of the Detractor Recommendations
There is some credence to the idea that people who feel neutral, or even nominally above neutral (5 and 6), may still recommend a company (a point raised by Grisaffe, 2007). We also see some recommending with this data too (see the 5s and 6s in Figure 2, for example). In fact, most of the recommendations from detractors come from 5s and 6s: 64% for the selected brands, 86% for recent purchases, and 77% for recent recommendations all come from 5s and 6s.
Least Likely To Recommend Don’t Recommend
Often the same data, when viewed from another perspective, can be as informative. Another way to look at Table 3 is that very few detectors actually recommended the brand: between 81% and 95% of detractors did not recommend. Taken this further, the least likely to recommend respondents (the 0s to 4s) also really don’t recommend. Table 3 shows between 93% and 99% of the least likely to recommend actually didn’t recommend. This is quite similar to other published findings which found that 88% of respondents who were said they were unlikely to purchase computer equipment didn’t make a purchase, and 95% of “non-intenders” didn’t recommend a TV show. In other words, if people express a low likelihood of recommending, they probably won’t recommend.
Correlation Between Intentions and Actions
But if you are wondering who’s most likely to recommend, your best bet is certainly promoters—especially 10s (between 30–62%; see Table 3). After the promoters, a non-trivial number of passives and a few detractors (mostly 5s and 6s) will also recommend.
In general, as LTR scores go up, reported recommendation rates follow as Figures 1 and 2 show. This correlation can be described using the correlation coefficient r, which shows a strong correlation. For the selected brands the correlation between LTR and recommendations is r = .90, for recent purchases r = .79, and for recent recommendations r = .69.
However, the large spike at 10 makes the relationship noticeably non-linear and reduces the correlation coefficient’s ability to accurately describe this as a non-linear relationship (it most likely underestimates it). Better modeling this non-linear relationship is the subject of future research.
Summary & Takeaways
Using both a single-point survey and a longitudinal study we compared likelihood-to-recommend scores on several companies to self-reported recommendations at 30, 60, and 90 days. We found
Between 28% and 50% of people report recommending. The percent of respondents who reported recommending the company or product varied from a low of 28% across common brands, 37% for the most recently recommended product, and 50% for the most recent purchase. The 29% recommend rate for consumer software customers also falls within this range. The rate dipped lower when looking at specific brands with smaller sample sizes (e.g. Enterprise, Walmart and Best Buy).
Customers who respond 10 tend to recommend. We saw in both datasets we analyzed that respondents who answered 10, the “top-box” score for likelihood to recommend, accounted for by far the largest portion of self-reported recommendations. There was a much larger difference between the 10s and 9s than between the 9s and 8s, lending credence to the idea that the extreme responders may be a better indicator of behavior (a non-linear relationship).
Most recommendations come from promoters. While at most around half of customers recommend, most recommendations do come from promoters. While passives and detractors still report recommending a company, promoters are between 2 and 16 times more likely to recommend than detractors. The higher a person scores on the 11-point LTR scale, the more likely they are to recommend, with the biggest (non-linear) jump happening between 9s and 10s.
Between 51% and 77% of recommendations come from promoters. Reichheld reported that promoters accounted for 80% of company referrals. Using a similar measure, we found promoters accounted for a smaller, albeit still substantial amount, of self-reported recommendations depending on how the data is analyzed.
Detractors Don’t Recommend. Respondents who indicated they were the least likely to recommend indeed didn’t recommend. Between 81% and 95% of detractors did not recommend and between 93% and 99% of 0 to 4 responses also didn’t report recommending the brand. This asymmetric relationship suggests that if people say they will recommend, they might; if people say they won’t recommend, they almost certainly won’t recommend.
To get 80% of recommendations, you need to include the passives. In each way we examined the percentage of recommendations, promoters, while accounting for most, fell short of the 80% threshold (fluctuating between 51% and 77% depending on what was analyzed). Should you want to increase the chance you are accounting for 80% of future recommendations, include the passive responders (7s and 8s).
This time period was limited but a lot happens in 30 days. One limitation of our analysis was that we had only a 90 day look-back window for participants to have recommended. For some products and services, this may be an insufficient amount of time to allow for recommending. Interestingly though, the 30-day recommend rate was similar to the 60- and 90-day rates, which also matched the software recommendation rate of 29%. This could be an anomaly of our seasonal data collection period (over Christmas). A future analysis can examine a longer time period.
This study was limited to self-reported recommendations. In our analyses, we relied on respondents’ abilities to accurately self-report recommending to friends and colleagues. A future analysis can help corroborate our findings by linking actual recommendations (and ultimately referrals) to likelihood-to-recommend rates.
Purchase and recommending have momentum. One reason we could see more recommendations from promoters is that participants who recommended in the past are likely to continue purchasing and recommending (something we discussed in our NPS study). Our data collection method may be biased toward more favorable attitudes and this in turn may inflate our numbers (which may explain why promoters account for a higher share of the recommendations for recent purchases and..
In fact, around 30% of recent marriages started online, but it’s not like finding a date is as easy as filtering choices on Amazon and having them delivered via drone the next day (not yet at least).
Dating can be hard enough, but in addition to finding the right one, you also have to deal with things like Nigeria-based scams (and not the one with the Prince!).
Even when someone’s not directly trying to steal your money, can you really trust the profiles? By one estimate, over 80% of profiles studied contained at least one lie (usually about age, height, or weight).
Online dating isn’t all bad though. There is some evidence that the online dating sites actually do lead to marriages with slightly higher satisfaction and slightly lower separation rates. It could be due to the variety of people, those mysterious algorithms, or just a self-selection bias.
To understand the online dating user experience, we conducted a retrospective benchmark on seven of the most popular dating websites.
We asked 380 participants who had used one of the seven dating websites in the past year to reflect on their most recent experience with the service.
Participants in the study answered questions about their prior experience, and desktop website users answered the 8-item SUPR-Q and the Net Promoter Score. In particular, we were interested in visitors’ attitudes toward the site, problems they had with the site, and reasons they used the website.
Measuring the Dating Website UX: SUPR-Q
The SUPR-Q is a standardized measure of the quality of a website’s user experience and is a good way to gauge users’ attitudes. It’s based on a rolling database of around 150 websites across dozens of industries.
Scores are percentile ranks and tell you how a website experience ranks relative to the other websites. The SUPR-Q provides an overall score as well as detailed scores for subdimensions of trust, usability, appearance, and loyalty. Its ease item can also predict an accurate SUS equivalent score.
The scores for the six dating websites (excluding the Hinge app) in the perception study were below average at the 43rd percentile (scoring better than 43% of the websites in the database). SUPR-Q scores for this group range from the 19th percentile (Plenty of Fish) to the 69th percentile (eHarmony).
Distrust and Disloyalty
The top improvement area for all the dating websites was trust. Across the websites, the average trust score was in the 23rd percentile. Participants express highest trust toward eHarmony—but trust scores were still only slightly above average (54th percentile). Plenty of Fish had the lowest trust score (5%), followed by Tinder (10%). These lower trust scores were consistent with the studies we found that cite false information and even scams.
The NPS reflects trust scores. eHarmony, the most trusted website, was also the most likely to be recommended with an NPS of 11% while the least trusted site, Plenty of Fish, had the lowest NPS (-46%). Overall, the average NPS score was a paltry -23%.
High Mobile App Usage
Not surprisingly, mobile app usage for dating services is high. 77% of participants reported visiting a dating service using a mobile app while only 61% said they log on using a desktop or laptop computer.
Most participants reported visiting their dating website on a desktop or laptop computer a few times per year, while mobile app users said they log on a few times a week or a few times a month. 19% of Match.com participants reported using the mobile app as much as once a day.
“The app is definitely more easy to use and intuitive, while the website seems more like an afterthought.” —Tinder user
“It’s one of the few instances of ‘websites turned into apps’ that I actually find value in.” —OkCupid user
Across the dating services, over half of participants reported they were looking for a serious relationship and just under half said they were looking for a casual relationship.
Reasons to use the dating services were similar for the website and app, except 42% of desktop website users said they are looking for a friendship while only 29% of mobile app users are looking for friends. Interestingly, this was a statistical difference.
Most Lack Dating Success
While over half of participants reported visiting dating sites to find a serious relationship, only 22% said they’ve actually found a relationship through the service. Specifically, OkCupid and Tinder users had the highest dating success in the group; 35% of OkCupid and 30% of Tinder users reported finding a relationship through the service.
Figure 1: Percent of respondents by site who report being in a relationship with a person they met on the website or app in the last year.
“I liked answering a lot of questions which would increase the match percentages I’d be able to find.” —OkCupid user
Only 9% of Zoosk users said they have found a relationship using the service. Zoosk users’ top issues with the site were dishonest users and fake profiles, poor matches, and active users who don’t respond.
“May not be great for a serious relationship.” —Zoosk user
“I keep getting referrals that are far outside my travel zone.” —Zoosk user
Dating Scams and Dishonest Users
Participants reported worrying about dishonest users and scams. On average, only 33% agreed that other users provide honest information about themselves and 41% said they are afraid of dating scams. These were the top issues reported by OkCupid and Plenty of Fish users.
“There are tons of fake/spam profiles.” —OkCupid user
“More scams than anything.” —Plenty of Fish user
“There are many suspicious profiles that seem like catfishing.” —Plenty of Fish user
Providing honest information on the site was found to be a significant key driver and explains about 17% of the dating site user experience. Other key drivers included brand attitude (22%), nicely presented profiles (10%), ease of creating and managing profiles (9%), intuitive navigation (9%), and ease of learning about other people (8%).
Figure 2: Key drivers of the online dating website user experience (SUPR-Q scores).
Together, these six components are key drivers of the dating website user experience and account for 75% of the variation in SUPR-Q scores.
Safety and Protection Resources
Across the dating services, 18% of participants reported having an issue with another user in the past. Plenty of Fish had the highest instance of issues with other users at 40%; however, 74% of those participants said there were resources available on the site to deal with this.
“I blocked the person because he was being very disrespectful.” —Plenty of Fish user
“I blocked the person who was harassing me.” —Plenty of Fish user
Using a dating service comes with obvious safety concerns and it’s felt by a fair number of users. Across the websites, only 54% of participants agreed that they feel safe using the site. Tinder had the lowest agreement to this item, only 38%.
“There is not more protection from the terrible men that are on there.” —Plenty of Fish user
“They do not seem to screen out people with criminal backgrounds. Found local sex offenders in this app. It is also difficult to unsubscribe.” —Plenty of Fish user
“Somebody posted improper material in profile and I reported it to admin.” —OkCupid user
Plenty of Fish had the highest rate of unwanted images, with 61% of women reporting at least one unwanted image compared to 35% of men. eHarmony and Tinder had similar but slightly lower unwanted image rates.
Poor Matching Algorithms
While the right algorithm can help create a match, participants reported algorithms often fell short. Less than half of participants on Match.com, Plenty of Fish, Tinder, and Zoosk agreed with the statement “the site is good at matching me with people” and only 14% of Tinder users said the site asks meaningful questions about users.
“Sometimes it’s hard to sort the matches by compatibility.” —Match.com user
“I find that its match system doesn’t help a great deal in finding whether someone is well suited for you, and it is rather glitchy, with people appearing after thumbing them down.” —OkCupid user
“Poor quality of fish on the site.” —Plenty of Fish user
An analysis of the user experience of seven dating websites found:
Dating is hard; the user experience is probably harder. Current users find the dating website experience below average, with SUPR-Q scores falling at the 43rd percentile. eHarmony was the overall winner for the retrospective study at the 69th percentile, with Plenty of Fish scoring the lowest at the 19th. The top improvement area across the sites was trust. eHarmony also had the highest NPS (11%) while Plenty of Fish had the lowest (-46%).
Participants prefer using mobile apps. 77% of participants reported using the dating service mobile app. The majority of participants reported visiting the dating services a few times per week on their mobile device. 19% of Match.com users said they use the app every day. Participants reported using the app more frequently than the website for each of the dating services.
High hopes with modest success. Over half of participants reported visiting dating sites to find a serious relationship, but only 22% said they have found a relationship through the service. Specifically, OkCupid and Tinder had the highest dating success; 35% of OkCupid and 30% of Tinder users reported finding a relationship. Only 9% of Zoosk users said they found a relationship using the site.
Users are concerned about dating scams and dishonest users. Participants reported worries regarding scams on the dating sites. On average, only 33% agreed that other users provide honest information about themselves and 41% said they are afraid of dating scams. Providing honest information on the site was found to be a significant key driver and explains about 17% of the dating site user experience.
No one likes getting lost. In real life or digitally.
One can get lost searching for a product to purchase, finding medical information, or clicking through a mobile app to post a social media status.
Each link, button, and menu leads to decisions. And each decision can result in a mistake, leading to wasted time, frustration, and often the inability to accomplish tasks.
But how do you measure when someone is lost? Is this captured already by standard usability metrics or is a more specialized metric needed? It helps to first think about how we measure usability.
We recommend a number of usability measures to assess the user experience, both objectively and subjectively (which come from the ISO 9241 standard of usability). Task completion, task time, and number of errors are the most common types of objective task-based measures. Errors take time to operationalize (“What is an error?”), while task completion and time can often be collected automatically (for example, in our MUIQ platform).
Perceived ease and confidence are two common task-based subjective measures—simply asking participants how easy or difficult or how confident they are they completed the task. Both tend to correlate (r ~ .5) with objective task measures [pdf]. But do any of these objective or subjective measures capture what it means to be lost?
What Does It Mean to Be Lost?
How do you know whether someone is lost? In real life you could simply ask them. But maybe people don’t want to admit they’re lost (you know, like us guys). Is there an objective way to determine lostness?
In the 1980s, as “hypertext” systems were being developed, a new dimension was added to information regarding behavior. Designers wanted to know whether people were getting lost when clicking all those links. Earlier, Elm and Woods (1985) argued that being lost was more than a feeling (no Boston pun intended); it was a degradation of performance that could be objectively measured. Inspired by this idea, in 1996 Patricia Smith sought to objectively define lostness and described a way to objectively measure when people were lost in hypertext. But not much has been done with it since (at least that we could find).
While there have been other methods for quantitatively assessing navigation, in this article we’ll take a closer look at how Smith quantified lostness and how the measure was validated.
A Lostness Measure
Smith proposed a few formulas to objectively assess lostness. The measure is essentially a function of what the user does (how many screens visited) relative to the most efficient path a user could take through a system. It requires first finding the minimum number of screens or steps it takes to accomplish a task—a happy path—and then comparing that to how many total screens and unique screens a user actually visits. She settled on the following formula using these three inputs to account for two dimensions of lostness:
N=Unique Pages Visited
S=Total Pages Visited
R=Minimum Number of Pages Required to Complete Task
The lostness measure ranges from 0 (absence of lostness) to 1 (being completely lost). Formulas can be confusing and they sometimes obscure what’s being represented, so I’ve attempted to visualize this metric and show how it’s derived with the Pythagorean theorem in Figure 1 below.
Figure 1: Visualization of the lostness measure. The orange lines with the “C” is an example of how a score from one participant can be converted into lostness using the Pythagorean theorem.
Smith then looked to validate the lostness measure using data from a previous study using 20 students (16 and 17 year olds) from the UK. Participants were asked to look for information on a university department hypertext system. Measures collected included the total number of nodes (pages), deviations, and unique pages accessed.
After reviewing videos of the users across tasks, she found that her lostness measure did correspond to lost behavior. She identified the threshold of lostness scores above .5 as being lost, while scores below .4 as not lost, and the scores between .4 and .5 as indeterminate.
The measures were also used in another study that used eight participants with more nodes as reported in Smith. In the study by Cardle (a 1994 dissertation), similar findings of Lostness and Efficiency were found. But of the eight users, one had a score above .5 (indicating lost) when he was not really lost but exploring—suggesting a possible confound with the measure.
Given the small amount of data used to validate the lostness measure (and the dearth of information since), we conducted a new study to collect more data, to confirm thresholds of lostness, and see how this measure correlates with other widely used usability measures.
Between September and December 2018 we reviewed 73 videos of users attempting to complete 8 tasks from three studies. The studies included consumer banking websites, a mobile app for making purchases, and a findability task on the US Bank website that asked participants to find the name of the VP and General Counsel (an expected difficult task). Each task had a clear correct solution and “exploring” behavior wasn’t expected, thus minimizing possible confounds with natural browsing behavior that may look like lostness (e.g., looking at many pages repeatedly).
Sample sizes ranged from 5 to 16 for each task experience. We selected tasks that we hoped would provide a good range of lostness. For each task, we identified the minimum number of screens needed to complete each task for the lostness measure (R), and reviewed each video to count the total number of screens (S) and number of unique screens (N). We then computed the lostness score for each task experience. Post-task ease was collected using the SEQ and task time and completion rates were collected in the MUIQ platform.
Across the 73 task experiences we had a good range of lostness, from a low of 0 (perfect navigation) to a high of .86 (very lost) and a mean lostness score of .34. We then aggregated the individual experiences by task.
Table 1 shows the lostness score, post-task ease, task time, and completion rate aggregated across the tasks, with lostness scores ranging from .16 to .72 (higher lostness scores mean more lostness).
Table 1: Lostness, ease (7-point SEQ scale), time (in seconds), completion rates, and % lost (> .5) for the eight tasks. Tasks sorted by lostness score, from least lost to most lost.
Using the Smith “lost” threshold of .5, we computed a binary metric of lost/not lost for each video and computed the average percent lost per task (far right column in Table 1).
Tasks 8 and 5 have both the highest lostness scores and percent being lost. All participants had lostness scores above .5 and were considered “lost.” In contrast, only 6% and 19% of participants were “lost” on tasks 1 and 2.
You can see a pattern between lostness and the ease, time, and completion rates in Table 1. As users get more lost (lostness goes up), the perception of ease goes down, time goes up. The correlations between lostness and these task-level measures are shown in Table 2 at both the task level and individual level.
Table 2: Correlations between lostness and ease, completion rates, and time at the task level (n=8) and individual level (n = 73). * indicates statistically significant at the p < .05 level
As expected, correlations are higher at the task level as the individual variability is smoothed out through the aggregation, which helps reveal patterns. The correlation between ease and lostness is very high (r = -.95) at the task level and to a lesser extent at the individual level r = -.52. Interestingly, despite differing tasks, the correlation between lostness and task time is also high and significant at r= .72 and r = .51 at the task and individual levels.
The correlation with completion rate, while in the expected direction, is more modest and not statistically significant (see the “Comp” row in Table 2). This is likely a consequence of both the coarseness of this metric (binary) and a restriction in range with most tasks in our dataset having high completion rates.
The strong relation between perceived ease and lostness can be seen in the scatter plot in Figure 2, with users’ perception of the task ease accounting for a substantial ~90% of the variance in lostness. At least with our dataset, it appears that average lostness is well accounted for by ease. That is, participants generally rate high lostness tasks as difficult.
Figure 2: Relationship between lostness and ease (r = -.95) for the 8 tasks; p < .01. Dots represent the 8 tasks.
Table 3: Percent of participants “lost” and mean lostness score for each point on the Single Ease Question (SEQ).
Further examining the relationship between perceived ease and lostness, Table 3 shows the average percent of participants that were marked as lost (scores above .5) and the mean lostness score for each point on the Single Ease Question (SEQ) scale. More than half the task experiences were rated 7 (the easiest task score), which corresponds to low lostness scores (below .4). SEQ scores at 4 and below all have high lostness scores (above .6), providing an additional point of concurrent validity for the lostness measure. Table 4 further shows an interesting relationship. The threshold when lostness scores go from not lost to lost happens around the historical SEQ average score of 5.5, again suggesting that below average ease is associated with lostness. It also reinforces the idea that the SEQ (a subjective score) is a good concurrent indicator of behavior (objective data).
Mean SEQ Score
Table 4: Lostness scores aggregated into deciles with corresponding mean SEQ scores at each decile.
Validating the Lostness Thresholds
To see how well the thresholds identified by Smith predicted actual lostness, we reviewed the videos again and made a judgment as to whether the user was struggling or giving any indication of lostness (toggling back and forth, searching, revisiting pages). Of the 73 videos, two analysts independently reviewed 55 (75%) of the videos and made a binary decision whether the participant was lost or not lost (similar to the characterization described by Smith).
Lost Example: For example, one participant, when looking for the US Bank General Counsel, kept going back to the home page, scrolling to the bottom of the page multiple times, and using the search bar multiple times. This participant’s lostness score was .64 and was marked as “lost” by the evaluator.
Not Lost Example: In contrast, another participant, when looking for checking account fees, clicked a checking account tab, inputted their zip code, found the fees, and stopped the task. This participant’s lostness score was 0 (perfect) and was marked as “not lost” by the evaluator.
Table 5 shows the number of participants identified as lost by the evaluators corresponding to their lostness score grouped into deciles.
Table 5: Percent of participants characterized as lost or not lost from evaluators watching the videos.
For example, of the 28 participant videos with a lostness score of 0, only 1 (4%) was considered lost. In contrast, 6 out of the 8 (75%) participants with lostness scores between .8 and .9 were considered lost. We do see good corroboration with the Smith thresholds. Only 9% (3 of 34) of participants with scores below .4 were considered lost. Similarly, 89% (16 of 18) participants were considered lost who had scores above .5.
Another way to look at the data, participants who were lost had a lostness score that was more than 5 times as high as those who weren’t lost (.61 vs .11; p <.01).
Summary and Takeaways
An examination of a method for measuring lostness revealed:
Lostness as path taken relative to the happy path. An objective lostness measure was proposed over 20 years ago that uses the sum of two ratios: the number of unique pages relative to the minimum number of pages, and the total number of pages relative to the unique number of pages. Computing this lostness measure requires identifying the minimum number of pages or steps needed to complete a task (the happy path) as well as counting all screens and the number of unique screens (a time-consuming process). A score of 0 represents perfectly efficient navigation (not lost) while a score of 1 indicates being very lost.
Thresholds are supported but not meant for task failure. Data from the original validation study had suggested lostness values below .4 indicated that participants weren’t lost and values above .5 as participants being lost. Our data corroborated these thresholds as 91% of participants with scores below .4 were not considered lost and 89% of participants with scores above .5 were lost. The thresholds and score, however, become less meaningful when a user fails or abandons a task and visits only a subset of the essential screens, which decreases their lostness score. This suggests lostness may be best as a secondary measure to other usability metrics, notably task completion.
Perceived ease explains lostness. In our data, we found that average task-ease scores (participant ratings on the 7-point SEQ) explained 95% of the variance in lostness scores. At least with our data, in general, when participants were lost, they knew it and rated the task harder (at least when aggregated across tasks). While subjective measures aren’t a substitute for objective measures, they do correlate, and post-task ease is quick to ask and analyze. Lower SEQ scores already indicate a need to look further for the problems and this data suggests participants getting lost may be a culprit for some tasks.
Time-consuming process is ideally automated. To collect the data for this validation study we had to review participant videos several times to compute the lostness score (counting screens and unique screens). It may not be worth the effort to review videos just to identify a lostness score (especially if you’re able to more quickly identify the problems users are having with a different measure). However, a lostness score can be computed using software (something we are including in our MUIQ platform). Researchers will still need to input the minimal number of steps (i.e., the happy path) per task but this measure, like other measures such as clicking non-clickable elements, may help quickly diagnose problem spots.
There’s a distinction between browsing and being lost. The tasks used in our replication study all had specific answers (e.g. finding a checking account’s fees). These are not the sort of tasks participants likely want to spend any more time (or steps) on than they need to. For these “productivity” tasks where users know exactly what they need to do or find, lostness may be a good measure (especially if it’s automatically collected). However, for more exploratory tasks where only a category is defined and not a specific item, like browsing for clothing, electronics, or the next book to purchase, the natural back-and-forth of browsing behavior may quantitatively look like lostness. A future study can examine how well lostness holds up under these more exploratory tasks.
But even the most usable product isn’t adequate if it doesn’t do what it needs to.
Products, software, websites, and apps need to be both usable and useful for people to “accept” them, both in their personal and professional lives.
That’s the idea behind the influential Technology Acceptance Model (TAM). Here are 10 things to know about the TAM.
1. If you build it, will they come? Fred Davis developed the first incarnation of the Technology Acceptance Model over three decades ago at around the time of the SUS. It was originally part of an MIT dissertation in 1985. The A for “Acceptance” is indicative of why it was developed. Companies wanted to know whether all the investment in new computing technology would be worth it. (This was before the Internet as we know it and before Windows 3.1.) Usage would be a necessary ingredient to assess productivity. Having a reliable and valid measure that could explain and predict usage would be valuable for both software vendors and IT managers.
2. Perceived usefulness and perceived ease of use drive usage. What are the major factors that lead to adoption and usage? There are many variables but two of the biggest factors that emerged from earlier studies were the perception that the technology does something useful (perceived usefulness; U) and that it’s easy to use (perceived ease of use; E). Davis then started with these two constructs as part of the TAM.
Figure 1: Technology Acceptance Model (TAM) from Davis, 1989.
3. Psychometric validation from two studies. To generate items for the TAM, Davis followed the Classical Test Theory (CTT) process of questionnaire construction (similar to our SUPR-Q). He reviewed the literature on technology adoption (from 37 papers) and generated 14 candidate items each for usefulness and ease of use. He tested them in two studies. The first study was a survey of 120 IBM participants on their usage of an email program, which revealed six items for each factor and ruled out negatively worded items that reduced reliability (similar to our findings). The second was a lab-based study with 40 grad students using two IBM graphics programs. This provided 12 items (six for usefulness and six for ease).
1. Using [this product] in my job would enable me to accomplish tasks more quickly.
2. Using [this product] would improve my job performance.*
3. Using [this product] in my job would increase my productivity.*
4. Using [this product] would enhance my effectiveness on the job.*
5. Using [this product] would make it easier to do my job.
6. I would find [this product] useful in my job.*
Ease of Use Items
7. Learning to operate [this product] would be easy for me.
8. I would find it easy to get [this product] to do what I want it to do.*
9. My interaction with [this product] would be clear and understandable.*
10. I would find [this product] to be flexible to interact with.
11. It would be easy for me to become skillful at using [this product].
12. I would find [this product] easy to use.*
*indicate items that are used in later TAM extensions
4. Response scales can be changed. The first study described by Davis used a 7-point Likert agree/disagree scale, similar to the PSSUQ. For the second study, the scale was changed to a 7-point likelihood scale (from extremely likely to extremely unlikely) with all scale points labeled.
Figure 2: Example of the TAM response scale from Davis, 1989.
Jim Lewis recently tested (in press) four scale variations with 512 IBM users of Notes (yes, TAM and IBM have a long and continued history!). He modified the TAM items to measure actual rather than anticipated experience (see Figure 3 below) and compared different scaling versions. He found no statistical differences in means between the four versions and all predicted likelihood to use equally. But he did find significantly more response errors when the “extremely agree” and “extremely likely” labels were placed on the left. Jim recommended the more familiar agreement scale (with extremely disagree on the left and extremely agree on the right) as shown in Figure 3.
Figure 3: Proposed response scale change by Lewis (in press).
5. It’s an evolving model and not a static questionnaire. The M is for “Model” because the idea is that multiple variables will affect technology adoption, and each is measured using different sets of questions. Academics love models and the reason is that science relies heavily on models to both explain and predict complex outcomes, from the probability of rolling a 6, gravity, and human attitudes. In fact, there are multiple TAMs: the original TAM by Davis, a TAM 2 that includes more constructs put forth by Venkatesh (2000) [pdf], and a TAM 3 (2008) that accounts for even more variables (e.g. subjective norm, job relevance, output quality, and results demonstrability). These extensions to the original TAM model show the increasing desire to explain the adoption (or lack thereof) of technology and to define and measure the many external variables. One finding that has emerged across multiple TAM studies has been that usefulness dominates and ease of use functions through usage. Or as Davis said, “users are often willing to cope with some difficulty of use in a system that provides critically needed functionality.” This can be seen in the original model of TAM in Figure 1 where ease of use operates through usefulness in addition to usage attitudes.
6. Items and scales have changed. In the development of the TAM, Davis winnowed the items from 14 to 6 for the ease and usefulness constructs. The TAM 2 and TAM 3 use only four items per construct (the ones with asterisks above and a new “mental effort” item). In fact, another paper by Davis et al. (1989) also used only four. There’s a need to reduce the number of items because as more variables get added, you have to add more items to measure these constructs and having an 80-item questionnaire gets impractical and painful. This again emphasizes the TAM as more of a model and less of a standardized questionnaire.
7. It predicts usage (predictive validity). The foundational paper (Davis, 1989) showed a correlation between the TAM and higher self-reported current usage (r = .56 for usefulness and r = .32 for ease of use), which is a form of concurrent validity. Participants were also asked to predict their future usage and this prediction had a strong correlation with ease and usefulness in the two pilot studies (r = .85 for usefulness and r = .59 for ease). But these correlations were derived from the same participants at the same time (not a longitudinal component) and this has the effect of inflating the correlation. (People say they will use things more when they rate them higher.) But another study by Davis et al. (1989) actually had a longitudinal component. It used 107 MBA students who were introduced to a word processor and answered four usefulness and four ease of use items; 14 weeks later the same students answered the TAM again and self-reported usage questions. Davis reported a modest correlation between behavioral intention and actual self-reported usage (r = .35). A similar correlation was validated by explaining 45% of behavioral intention, which established some level of predictive validity. Later studies by Venkatesh et al. (1999) also found a correlation of around r = .5 between behavioral intention and both actual usage and self-reported usage.
8. It extends other models of behavioral prediction. The TAM was an extension of the popular Theory of Reasoned Action (TRA) by Ajzen and Fishbein but applied to the specific domain of computer usage. The TRA is a model that suggests that voluntary behavior is a function of what we think (beliefs), what we feel (attitudes), our intentions, and subjective norms (what others think is acceptable to do). The TAM posits that our beliefs about ease and usefulness affect our attitude toward using, which in turn affects our intention and actual use. You can see the similarity in the TRA model in Figure 4 below compared to TAM in Figure 1 above.
Figure 4: The Theory of Reasoned Action (TRA), proposed by Ajzen and Fishbein, of which the TAM is a specific application for technology use.
9. There are no benchmarks. Despite its wide usage, there are no published benchmarks available on TAM total scores nor for the usefulness and ease of use constructs. Without a benchmark it becomes difficult to know whether a product (or technology) is scoring at a sufficient threshold to know whether potential or current users find it useful (and will adopt it or continue to use it).
10. The UMUX-Lite is an adaptation of the TAM. We discussed the UMUX-Lite in an earlier article. It has only two items which offer similar wording to items in the original TAM items: [This system’s] capabilities meet my requirements (which maps to the usefulness component), and [This system] is easy to use (which maps to the ease component). Our earlier research has found even single items are often sufficient to measure a construct (like ease of use). We expect the UMUX-Lite to increase in usage in the UX industry and help generate benchmarks (which we’ll help with too!).
Thanks to Jim Lewis for providing a draft of his paper and commenting on an earlier draft of this article.