How many users does it take to screw up a project?

Years ago I read a very interesting science fiction story from the great Isaac Asimov, called Franchise. 

The story was based in a future where computers were increasingly able to predict human behaviour and voting behaviour in particular - so much so, that the computer was able to reduce the sample size of the voting public further and further, to the point where it could identify a single voter who supposedly represented everyone. That poor slob was forced to vote, and he was held responsible for what happened from there. The story was inspired by the ability of the UNIVAC computer’s ability to predict the result of the 1952 US presidential election. 

Recently I was thinking about that story and the relevance it has to user research sample sizes. We UXers, like UNIVAC, aim to deduce from a few what relates to the masses.

Since the early days of UX research there has been a tension between the classic market research/consumer research fields (where sample sizes are huge and statistically relevant) and the UX world (where sample sizes are often far smaller).  The logic was simple; since people often think in similar ways, a few people will encounter most of the problems that a far larger number of people will see.

So - how many is enough?

In user testing and research, there has been an increasing trend towards smaller sample sizes. In the 90’s most user testing involved 10-12 people at a minimum, and often up to 20. Since then the average has fallen to just 5 or 6 people per test.

But how slim can the numbers be without losing quality? Will we end up, like UNIVAC, identifying a single user who is forced at gunpoint to identify all the problems in the system?

It’s quite possible we’ve already hit a risky level of sample size coverage. We may already be taking more risks than we're aware of.

 

So what is sample size?

Sample size is a critical element in any scientific research. It is the number of objects, people, animals or plants you are examining in order to test the outcome of some variable - in this case, the usability level of an interface, product or thing.  

Too many subjects and you’re wasting time and money - whilst that’s not a problem from the science perspective, in the business world that’s no small matter. Too few and your sample will most likely not uncover all of the issues there are to see. We all know that you might not uncover everything, but how big can that gap get - are you seeing 95% of the issues, or just 25%? 

Let’s take a look at the research.

For years, the five-user assumption has been debated in usability. We’ve heard that the law of diminishing returns means that a user test involving five-to-eight users will reveal around 80% of the usability problems in a test (Nielsen, 1993; Virzi, 1992). That’s an assumption that many UX companies and researchers around the world have used as a start point.

The first problem with that ‘five-to-eight’ line is the wriggle room it leaves. Fitting eight user tests into a day (at one hour each) is quite difficult, so many research companies started aiming for seven, or six. Over time, that fell to six or just five. Now, it’s commonplace to see five users tested in a research day, with the assumption that we’re still getting the same coverage we would have seen at eight people. For practical and financial reasons, we took short cuts.

The main problem with this assumption remains - there is variability in human emotional reactions, associations and thought processes. Whilst a smaller number can be representative of the wider audience, there is a lower limit here that we are ignoring somewhat.

Spool and Schroeder (2001) aimed to test the reliability of this five-user rule. Using 49 participants, they performed usability tests requiring each participant to visit an online store and purchase a product. On two of the websites used, the first five users only found 35% of the total problems presented. This lack of coverage was found again in a study by Perfetti and Landesman (2002), who performed tests with 18 participants, and found that each of the participants from 6-18 uncovered at least five issues that were not uncovered in the first five participants. While the validity and severity of the additional issues uncovered may vary, this hints strongly at the need for a larger sample size.

A study by Faulkner (2003) aimed to test this further, by having 60 participants perform a usability test (after being categorised by their skill level). Afterwards a program drew random samples of user data, in sample sizes of 5, 10, 20, 30, 40, 50 and 60. Perhaps not unexpectedly, the results showed high variation in the results when only 5 people were involved, with the amount of issues uncovered ranging from 55% to nearly 100% - and this decreased as more users were added. In fact, none of the sample group size ‘20’ found less than 95% of the problems.

So all of this research indicates that same size is a risk game, something we always knew.

In theory, a sample size of five may uncover 100% of the issues to be found - but as the graph below shows, your chances of hitting those five people are pretty small.

The image above shows the variance in sample sizes, with each dot representing a sample group, as defined by the label below the graph. What this shows is quite interesting, I've tried to summarise  this into three key points:

  1. If you test with 5 people, then you are most likely to find around 85% of the issues - but you may find as little as 55% of them. 
  2. If you test with 10 people, then you are most likely to find around 95% of the issues - but you may just 82% of them.
  3. If you test with 15 - 20 people, you will most likely to find around 90 - 97% of the issues - although you're now spending up to twice as much to find those last few percentage points.

 

So how many should you test?

As always, life is about balance. Five people (or six) can be tested in a single day, whereas ten or twelve needs two days of research. If you are engaging with a research company like us then that means almost twice the cost. 

But it also means the potential minimum number of issues found increases from 55% to 82%.

So how many to test is, as always, a factor of how much risk you can take, how much you can spend on the research, and how often you are testing.

But just to be safe - start with ten.