Jakob Nielsen's latest Alertbox, titled "Quantitative Studies: How Many Users to Test?" looks at the improvement in confidence intervals and margins of error to be had through increasing the number of users tested when measuring usability metrics. [Note: this is in contrast to his
recommendation to test 5 users when looking qualitatively at usability.]
The article draws on large numbers of usability tests carried out by the NNg and provides some interesting points of note:
i) Usability metrics tend to follow a Normal (or Guassian) distribution - which makes the statistical analysis that much more convenient;
ii) User time-on-task performance tends to show a standard deviation of 52% of the mean;
iii) 20 users offers a reasonable test sample size for most usability metrics.
A couple of counter-points worth keeping in mind:
i) Although the finding that usability metrics tend to follow a Normal distribution is useful, most statistical analysis techniques include methods whereby the Normal distribution is not a requirement. This is particularly the case when performing quantitative analysis on non-parametric data (e.g. ranks, categorisations or counts);
ii) Always calculate the standard deviation and margin of error based on the data that you've collected. Whilst NNg's insight provides a useful starting point for deciding the number of test subjects, you need to go through the process of calculation sd and e for your data set;
iii) NNg use a 90% confidence interval as their baseline for determining the recommended number of test subjects: sometimes this level of confidence is insufficient, and so a greater number of test subjects would be required. (I tend to use 95% CI myself, particularly if the implementation cost is high.)
iv) Confidence intervals form part of the general set of summary statistics about a data set (along with mean, variance etc). They describe a characteristic of a particular sample, which allows some inference as to the nature of the general population - they don't provide a comparison between two populations. For example, is a time-on-task CI of 3 mins +/- 30secs better or worse than a time-on-task CI of 3:10mins +/- 15 secs?
Finally, this article feels like a response to the
JUS article cited
here previously. In particular, a response to Lewis and Sauro's references to use of 5 test subjects for usability studies, which may itself have been influenced by a previous
Alertbox article.