Basic Concepts in Power Analysis The number of subjects in your study is critical to producing meaningful results Robin High robinh@uoregon.edu What is "statistical power analysis"? And how and under what circumstances is it applied? The purpose of many research projects is to search for evidence that a value of a parameter from a population of interest is different than the hypothesized value. In other words, if the computed p-value from a statistical test is 'small' (e.g., less than an arbitrary cutoff value), the null hypothesis is rejected in favor of the alternative. This process is comparable to an investigation in which the researcher determines whether sufficient evidence exists from a collected sample of data to change our view of a population characteristic. Power is broadly defined as "the probability that a statistical significance test will reject the null hypothesis for a specified value of an alternative hypothesis." Another way to conceptualize it is "the ability of a test to detect an effect, given that the effect actually exists." Here is a simple, illustrative example: suppose you want to explore the effect regular exercise has on a person's "quality of life." When designing such an experiment, what questions should you consider? Initial Considerations. Assume the exercise program is of scientific value and two groups with an equal number of subjects randomly assigned to each group will be available (one group is assigned to the exercise program and one group assigned as a control). Here are a few important considerations: * Has any previous research already been done on this topic? * How will the response variable, "quality of life," be measured? * What experimental design strategy would work best? * What factors would you need to control or hold constant? * What personal characteristics should be measured yet are not considered part of the design? * Sample size: how many subjects are needed? This is only a partial list of many important questions that must be carefully considered when designing any data collection activity. All too often, however, the question of sample size is slighted-or perhaps even ignored-even though it's critical to the usefulness of the results. Essential information. When performing a statistical power analysis, you'll need to consider the following important information: 1. Significance level or the probability of a Type I error. A common, yet arbitrary, choice is %alpha =.05 2. Power to detect an effect. This is expressed as power=1 - %beta, where beta is the probability of a Type II error. Power=0.80 is also a common, yet arbitrary, choice. 3. Effect size (in actual units of the response) the researcher wants to detect. Effect size and the ability to detect it are indirectly related. The smaller the effect size, the more difficult it will be to find it. 4. Variation in the response variable. The standard deviation, which usually comes from previous research or pilot studies, for the response variable of interest is often used. 5. Sample size. A larger sample size generally leads to parameter estimates with smaller variances, giving you a greater ability to detect a significant difference. These five components of a power analysis are not independent: in fact, any four of them automatically determines the fifth. The usual objective of a power analysis is to calculate the sample size (5) for given values of items (1)-(4). In studies with limited resources, the maximum sample size will be known. Power analysis then becomes a useful tool to determine if sufficient power exists (2) for specified values of (1), (3), (4), and (5). As a result, the researcher can evaluate whether the study is worth pursuing. Comments Significance Level. Using %alpha=0.05 for the probability of a Type I error, is completely arbitrary. In fact, I am 90% confident that 75% of researchers do not know what %alpha actually means, so how could a "correct" cutoff level be chosen? Values of 0.10, 0.05, and 0.01 are nice round numbers for %alpha for which convenient tables are available. Today we have computers that can apply any value of %alpha, but for some reason people think that there is a written commandment that 'thou must use %alpha=.05'. Similar comments can be made concerning %beta, the probability of a Type II error, but explaining this important probability is beyond the limited scope of this article What effect size is meaningful? The size of a practical difference in the response you would like to detect among the groups is crucial. It essentially measures the "distance" between the null (H0) and a specified value of the alternative hypothesis (HA). It also relates to the underlying population, not data from a sample. A desirable effect size is the degree of deviation from the null hypotheses (in actual units of the response) that is considered large enough to attract your attention. Jacob Cohen, an important contributor to power analysis, defined effect sizes as small, medium, and large, and he has stated that "all null hypotheses, at least in their two-tailed forms, are false." A difference is always going to be there; however, it might exist in such a small quantity that you should not be concerned about finding it. The concept of small, medium, and large effect sizes can be a reasonable starting point if you do not have more precise information. (Note that an effect size should be stated in terms of a numerical value, not a percent change such as 5% or 10%.) Returning to the example, if a difference in quality of life due to an exercise program exists, is the magnitude of the difference worth detecting? Suppose the level of exercise you apply to subjects in the treatment group will cause an observed change in quality of life of one unit on the chosen measurement scale. Is a one-unit change-or even 5 or 10 units-meaningful when facing the reality that many factors external to the study will also affect a person's quality of life? Estimates of Variation. You'll also need an estimate of the variability in the response of interest before you can determine the sample size needed to estimate an effect. These are often found from pilot studies or from previous research, although this information is all too often not readily available in published documents. Some parameters of interest are dimensionless quantities, such as a correlation or coefficient of variation, so in this case a standard deviation would not be required. Power calculations. Computing power for any specific study may very well be a difficult task. However, if you do not evaluate the joint influence of the size of the effect and the inherent variability of the response during the planning stage, one of two inefficient outcomes will most likely result: 1. "Low power" (too little data, meaningful effect sizes are difficult to detect). If too few subjects are used, a hypothesis test will result in such low power that there is little chance to find a significant effect. Consider someone attempting to start a car on a cold winter morning with a weak battery-it just doesn't provide the cranking power to get the engine going. This is analogous to designing an experiment in which resources were not put to optimal use (i.e., data from fewer subjects than the necessary number were collected to detect a meaningful effect). 2. "High power" (too much data, trivially small effect sizes can be detected). At the other extreme, consider an experiment where data collection is so large that a trivially small difference in the effect is detectable. One could describe this approach as the "Tim the Tool-Man" method ("MORE POWER! eh, eh"). If you have ever watched the popular TV show "Tool Time," you'll know exactly what I mean. Again, the researcher has not put all of his or her time and resources to good use - in statistical terms, too many subjects have been studied. A study with low power will have indecisive results, even if the phenomenon you're investigating is real. Stated differently, the effect may well be there, but without adequate power, you won't find it. The situation with high power is the reverse: you will likely see very significant results, even if the effect size you're investigating is not of practical value. Stated differently, the effect is there, but its magnitude is of little use. In conclusion, the number of subjects you use is critical to the success of a study. Without a sufficient number, you won't be able to achieve adequate power to detect the effect you're looking for. With too many subjects, you may be using valuable resources inefficiently. Either way, too little or too much power does not spend time and resources economically; this is viewed by some researchers as unethical scientific behavior. Computer Software for Power Calculations Several Internet sites will help you understand power analysis resources and get started with either commercial or free software. Descriptions of a few available programs are shown below. 1. nQuery Advisor Release 4.0 - This software, which is used for both sample size and power curve calculations, contains extensive tables for data entry and many convenient editing features. For more information, see http://www.statsol.ie 2. SamplePower(r) 1.2 - SamplePower, available from SPSS, arrives at sample sizes for a variety of common data analysis situations. You can learn more about it at http://www.spss.com/spower/research.htm 3. G*Power - You can download this free program from http://www.psychologie.uni-trier.de:8000/projects/gpower.html G*Power allows you to calculate sample sizes for given effect sizes, standard deviations, significance levels, and power values. It also allows you to calculate power curves for varying levels of the other essential data. 4. UnifyPow is another free power analysis program that uses SAS. You can find example programs and workshop notes at the UnifyPow web site at http://www.bio.ri.ccf.org/power.html