Dealing with 'Outliers': Maintain Your Data's Integrity "Outliers are observations that have extreme values relative to other observations observed under the same conditions. Observations may be outliers because of a single large or small value of one variable or because of an unusual combination of values of two or more variables." From 'Statistical Design and Analysis of Experiments', Mason, Gunst, and Hess. It's an unfortunate fact of research that data are not always well-behaved. "Outliers" - unusual data values - occur in almost all research projects involving data collection. This is especially true in observational studies where data may naturally take on very unusual values, even if they come from reputable sources. Data entry errors or rare events (such as readings from a thermometer left in the sun, a change in accounting practice, or a subject who has a sudden muscle spasm) - all these and many more are reasons for outliers to exist in a collection of data. Sources of Outliers Data Entry Errors. When looking for outlying observations, first check for data recording or entry errors. This may require looking at the value entered in the data file with the value actually recorded on the source document from which it was read during data entry. If data are entered manually, a spreadsheet such as EXCEL is likely to be a good choice since the Split option allows for column headers and row IDs to always be in view. This also allows for easy visual data checks which will improve input accuracy and therefore reduce the occurrence of data recording errors to begin with. Implausible Values. One type of data entry error is 'implausible' or 'impossible' values, for they make no sense when considering the expected range of the data. For example a temperature of 789 degrees Fahrenheit is rather obvious when a range of 35-105 is expected (unless you live on Venus). An out-of-range value is often easy to identify since it will most likely lie well outside the bulk of the data. "Rare" events. Another common cause for the occurrence of outliers is the "rare" event syndrome - extreme observations that for some legitimate reason are just fine, but do not fit within the typical range of other data values. Such unusual observations might include * a January day in Oregon that reaches 65 degrees * a 500 point rise/drop in a stock market index in one trading day * an unusually high score on an aggressiveness scale for a troubled child All these events may be considered relatively rare, but still should be considered part of the overall picture. You can write a set computer commands to identify data entry errors or extreme observations within a large or even small dataset. SAS is a particularly good tool for this purpose. Why Are Outliers a Problem? Developing techniques to look for outliers and understanding how they impact data analysis are extremely important parts of a complete analysis, especially when any statistical technique that involves a sum squares is to be applied to the data. In the presence of outliers, any statistical test based on sample means and variances can be distorted. For example, estimated regression coefficients that minimize the Sum of Squares for Error (SSE) are very sensitive to outliers. There are several problematic effects of outliers, including: * bias or distortion of estimates * inflated sums of squares (which make it unlikely you will partition sources of variation in the data into meaningful components) * distortion of p-values (statistical significance, or lack thereof, can be due to the presence of a few - or even one - unusual data value) * faulty conclusions (it's quite possible to draw false conclusions if you haven't looked for indications that there was anything unusual in the data) The following example may seem a bit extreme, but real data with this type of anomaly actually exist. The results vividly demonstrate potential problems that lurk in the background due to unusual data values. 95% Confidence Interval Sorted Data Median Mean Variance for the mean Real Data 1 3 5 9 12 5 6.0 20.0 [0.45 to 11.55] Data with Error 1 3 5 9 120 5 27.6 2676.8 [-36.630 to 91.83 ] ^^^ The first four cells across each row contain the same data values. However, in the second row, the fifth entry has a large discrepancy when compared to the value in the first row. Note that in the presence of one severe outlier, the median does not change in this example. This example shows why the median is robust (i.e., it probably will not vary much in the presence of a few outliers) and is often the preferred summary statistic for the "center" of a skewed or contaminated distribution. In the previous table, it only takes one outlier to greatly distort the mean, variance, and 95% confidence interval for the mean. Similar distortions apply to the computation of regression coefficients, p-values from the analysis of variance, or any technique that uses sums of squares in the calculations. Identification of Outliers Before beginning to detect outliers, if you ever are pointed to a "simple" routine for figuring out which points in a dataset are outliers, it is likely to be "wrong". There is no such thing as a simple test. However, there are many ways to look at a distribution of numerical values to see of certain points seem out of line with the majority of the data. And expert knowledge of what values data can have is probably the best solution. However, there are a few guidelines with which one can always begin. The "normal" distribution myth. Although not necessarily an issue with outliers, it is important to first recognize what the distribution of your data looks like. For many statistical modeling purposes, the data do not require a "normal" or symmetric, bell-shaped distribution. (This assumption applies to the residuals from a linear statistical model.) Data collected as counts will not usually look very "normal". Data that are collected across groups may have a distribution that has several local peaks. In fact, for data to be entered into a linear regression model, it is preferable for the independent or explanatory variables to not have a normal distribution. The mathematics behind linear regression demonstrates that normality is not required or even desirable for this type of analysis. What is important is to check for data values that lie well outside the range of other data (called leverage points) that will likely exert a strong influence on the results. Your objective should be to collect data with a distribution that allows you to make the best inferences possible about the population under study. Visual Aids Always check the distributions of data whether they be nominal or continuous. This procedure should be one of the first steps in data analysis, as it will quickly reveal the most obvious outliers. For continuous or interval data, a dotplot of a single variable or multi-dimensional of all pairwise scatterplots of continuous variables are good methods to visually detect outlying observations. With larger sample sizes a boxplot is another very helpful tool, since it makes no distributional assumptions nor does it require any prior estimate of a mean or standard deviation. Values that are extreme in relation to the rest of the data are easily identified. Univariate tests are avaialable which check for the presence of outliers; however, many of them are designed to test for the presence of only one outlier, and they also make strong distributional assumptions which are often not relevant (e.g. assume a normal distribution when you may have skewed non-negative data). They also may require that a location (mean) or scale (standard deviation) parameter be estimated from the data. As shown earlier, outliers greatly influence these two summary statistics. This is one reason why "eliminating data that exceed two or three standard deviations" may not be a good, or even a reasonable, decision rule. IQR computation. A simple task is to compute the inter-quartile-range (IQR) for continuous data and then take a multiple of it as a cut-off value to define values which are considered outliers. For large datasets, a boxplot applies this technique to identify outliers. It is an extremely effective approach, especially when you have 30 or more data points within each group level. Only basic computing skills are required to find the inter-quartile-range (IQR) and then compute the number that defines what values could be considered outliers. One way to apply this approach is to use PROC UNIVARIATE with SAS and save the order statistics available with its OUTPUT statement. The first quartile (q1), third quartile (q3), and inter-quartile range (iqr) can be saved to an output dataset or assigned to macro variables (see example below). You can then flag observations that lie outside of q1-(1.5*iqr) and q3+(1.5*iqr) as potential outliers and anything outside of q1-(3*iqr) and q3+(3*iqr) as problematic outliers. Here is a set of SAS commands that will detect outliers using order statistics. PROC UNIVARIATE DATA=mydata NOprint; VAR y; OUTPUT OUT=qdata Q1=q1 Q3=q3 QRANGE=iqr; RUN; DATA _null_; SET qdata; CALL SYMPUT("q1",q1); CALL SYMPUT("q3",q3); CALL SYMPUT("iqr",iqr); RUN; * save the outliers; DATA outliers; SET mydata; LENGTH severity $2; severity=" "; IF (y <= (&q1 - 1.5*&iqr)) OR (y >= (&q3 + 1.5*&iqr)) THEN severity="*"; IF (y <= (&q1 - 3*&iqr)) OR (y >= (&q3 + 3*&iqr)) THEN severity="**"; IF severity IN ("*", "**") THEN OUTPUT outliers; RUN; PROC PRINT DATA=outliers; VAR y severity; TITLE 'Data outliers for review'; RUN; Statistical tests for detection of outliers consist of two general types: Dixon tests for outliers use ratios of ranges and subranges of the data. Grubbs tests apply ratios of two sums of squares. Both tests are based on an assumption of normally distributed errors. Dixon tests are of primary benefit when only one or two observations are suspected as outliers and are relatively easy to calculate. Grubbs tests require more computations, but are generally more powerful and can be used to test for any number of outliers. Multivariate outliers can also lurk undetected in an analysis. Univariate tests for outliers are not designed to identify multivariate outliers. Consider two data values, x1 and x2, where neither one is considered an outlier when examined with a univariate test as described above. However, the combination of the two values can lie outside the periphery of the range of the data plotted in two-dimensional space (see Figure 1) - in this case the two values work together to create an influential or leverage point that, for example, can exert a strong influence on the computation of regression coefficients. Outliers Based on Median Absolute Deviation A non-parametric or distribution-free approach to detect outlies is based on computing medians 1. First, compute the median of the original input data 2. Compute absolute value of deviations of original input data from this median 3. Compute median of these absolute deviations 4. Compute ratio of absolute deviation from step 2 and median from step 3 5. If this ratio is greater than critical value THEN consider the value as an outlier Consider this collection of 10 scores, sorted from smallest to largest: x 8 25 35 41 50 75 75 79 92 129 ^ The median of these 10 values of x is 62.5, computed as (75+50)/2. Next, calculate the absolute value of the deviation of original data from median: x med abs_dev 50 62.5 12.5 75 62.5 12.5 75 62.5 12.5 79 62.5 16.5 41 62.5 21.5 ->| 35 62.5 27.5 ->| MEDIAN(abs_dev) = 24.5 = (21.5+27.5)/2 92 62.5 29.5 25 62.5 37.5 8 62.5 54.5 129 62.5 66.5 Next, compute a test statistic which is the column of absolute values computed above, divided by the mediate of the absolute values: Test Stat = abs_dev / (Med of abs Dev) Med of Test x Median abs_dev abs dev Statistic Outlier? 8 62.5 54.5 24.5 2.22449 25 62.5 37.5 24.5 1.53061 35 62.5 27.5 24.5 1.12245 41 62.5 21.5 24.5 0.87755 50 62.5 12.5 24.5 0.51020 75 62.5 12.5 24.5 0.51020 75 62.5 12.5 24.5 0.51020 79 62.5 16.5 24.5 0.67347 92 62.5 29.5 24.5 1.20408 129 62.5 66.5 24.5 2.71429 Yes The decision rule then is to compare this test statistic with an arbitrary cutoff point. A cutoff of 2.5 is conservative; 4.5 or 5 is more rigorous. If the Test Statistic > Critical value (=2.5), then define the observed value as an outlier. According to this cutoff value, the data above have one outlier (x=129). Outliers versus Influential Observations in Linear Regression It is always wise to check how outliers may influence the results of a statistical analysis. For example, there are several ways to view outliers in linear regression. Chatterjee and Hadi (Ref 7) give the following definitions: Outlier: An observation in which the Studentized residual is large relative to other observations in the data set. High-Leverage Point: A point with high leverage is one (or more points) located far away from center of points in the X space and may be regarded as outliers. In fact, the ordinary residuals, ei, and leverage values, pii, are related -- high leverage points tend to have small residuals. Influential Observation: Individually or jointly excessively influence the regression equation. The following definitions are important considerations within the context of linear regression. In particular, contrasting views of what is an outlier are illustrated with the following two figures: Y | | . A | | | | . . | . . . | . . | . . . | . . | . +-------------------------------------------- X Figure 2. Point A is a high leverage point that conforms to a linear relationship between X and Y. From the diagram, point A *is* an influential point in the sense that summary statistics will be stronger. The R2 statistic will likely be much larger than when point A is excluded from the calculations. The hat matrix calculations for X will easily tell you this. However, point A is not an influential point in terms of how it influences the estimated coefficients of the regression line. The slope of the line through those points would be about the same regardless of whether point A is in the data set or not. Now whether one considers point A to be an outlier or not depends on how one defines an outlier. Point A lies far away from the rest of the data, which is one way to define an outlier. However, when you examine the residual of Point A from the regression line it is quite small. Statistics such as Mahalanobis distance would quickly spot this outlier (or merely a visual interpretation of a scatterplot!) but looking at residuals alone (unstandardized, standardized, studentized) would not. In the plot below, point A is both an outlier and influential. Y | | | . . | . . . . A | . . . | . . . . | . . . . | . . . | . . . | . +----------------------------------------- X Figure 3. Point A is a high leverage point that does not conform to the linear relationship between X and Y. A point defined to be an outlier depends on how far it lies away from the rest of the data, regardless of whether it conforms to a model estimated by the rest of the data. A point is influential if it doesn't conform to the remainder of the data. Another way to examine the effects of an outlier is: do your results change substantially when computing results with and without this observation? If so, it is considered influential. In Figure 2 above, removing point A would not substantially change the coefficient estimates; however, it will considerably change R2 and p-values from significance tests. In Figure 3, removing Point A will substantially influence all summary statistics. What Should You Do With Outliers? Working with outliers with continuous data can pose rather difficult decisions. Neither ignoring nor deleting them at will are good solutions. If you do nothing, you may end up with a model that describes essentially none of the data - neither the bulk of the data nor the outliers. Even though your numbers may be perfectly legitimate, if they lie outside the range of most of the data, they can cause potential computational anomalies and resulting inference problems. Accommodation. Accommodation of outliers uses techniques to mitigate their harmful effects. One of its strengths is that accommodation of outliers does not need to precede identification. These techniques can be often be used without prior determination that outliers exist. However, keep in mind that identification and accommodation do not compete; rather, they reinforce each other. A few possible approaches to accommodating outliers are listed below. Nonparametric Methods. One very effective way to work with data is to use methods which are robust in the presence of outliers. Nonparametric statistical methods fit into this type of analyses and should be more widely applied to continuous or interval data than their current use. When outliers are not a problem simulation studies have indicated their ability to detect significant differences is only slightly smaller than corresponding parametric methods. Various forms of robust regression models and computer intensive approaches deserve attention. Transformations Data transformations may soften the impact of outliers since two common functions, square roots and natural logarithms, shrink larger values to a much great extent than they shrink smaller values. However, transformations strong enough to pull outliers into proximity with the rest of the data may compress the data too much. One should never transform data for the sole purpose of eliminating or reducing the impact of outliers. And also, these transformations require non-negative data (square root) or data that is greater than zero (logs), so they are not automatically an answer. Nevertheless, an appropriate transformation may have the beneficial effect of turning extraordinary points into merely the largest or smallest values in the dataset. Transformations may improve symmetry and (where required) linearity or additivity. Transformations may not fit into the theory of the model and they will affect its interpretation. A transformation of a variable does more than to make a distribution less skewed; it changes the relationship between the original variable and the other variables and thus will likely require a different interpretation of a main effect or interaction in an ANOVA model. Data transformations can be important tools but shouldn't be viewed as a cure-all for distributional problems associated with outliers. If a transformation is necessary, always do so for an explicit reason that will make the model more appropriate. Non-homogenous variances can often be handled by modern statistical programs (e.g., PROCs MIXED or GENMOD with SAS), so a transformation in this situation is generally not good statistical practice. Transforming a predicted value back to its original units is a non-linear function and thus is biased (see Kennedy). An additional term is required to be added to the predicted value before back-transformation. Deletion. Only as a last resort should outliers be deleted, and then only if they are found to be errors that can't be corrected or lie so far outside the range of the remainder of the data that they distort statistical inferences. When in doubt, you can report model results both with and without the outliers to see how much they change. Data transformations and deletion are important tools but shouldn't be viewed as a cure-all for computational problems associated with outliers. Transformations and/or outlier elimination should be an informed choice, not a routine task. Summary Despite the difficulties, exploring why outliers exist can provide many clues to the development of better models. In fact, many great discoveries in human history can be traced to a researcher exploring some outlying or unusual observed value. Outliers may indicate that an important range of the data has been ignored that is worth knowing about. This article only gives a brief introduction to the problems of outliers, their detection, and approaches to data analysis. It's presented with the hope that looking for unusual values will always be a regular part of your data analysis, and that your research objectives and knowledge of your subject matter will help you decide what to do with them once you find them. Always apply exploratory data analysis techniques that look for both univariate and multivariate outliers and then evaluate how they impact on the results with and without transformations, accommodation, and deletion. This will help you reach conclusions that are in line with your research objectives. A "common sense" approach is often the best solution. References 1. ASTM method E 178 on Dealing with Outlying Observations. 2. Barnett, V., and T. Lewis (1984). Outliers in Statistical Data, 2nd ed., Chichester: Wiley. 3. Barnett, V., and T. Lewis (1994). Outliers in Statistical Data, 3rd ed., New York: Wiley. 4. Beckman, R. J., and R. D. Cook, (1983). Outlier...s. Technometrics", vol. 25, pp. 119-149. 5. Blaedel, W. J., Meloche, V. W., Ramsay, J. A., "A comparison of criteria for the rejection of measurements," J. Chem. Educ., December 1951, 643-647. 6. Cook, R. D. (1977). "Detection of influential observations in linear regression" Technometrics 19, 15-18. 7. Chatterjee and Hadi (1988) Sensitivity Analysis in Linear Regression, New York: Wiley. 8. Cook, R. D. and S. Weisberg (1982). "Residuals and Influence in Regression". Chapman and Hall: New York. 9. Cook, D., and Weisberg, S., An Introduction to Regression Graphics, Wiley. 10. Dean, R. B., Dixon, W. J., "Simplified statistics for small numbers of observations," J. Anal. Chem., 23, 1951, 636-638. 11. Dixon, W. J., "Analysis of extreme values," Ann. Math. Stat., 21, 1950, 488-506. 12. Dixon, W. J., "Ratios involving extreme values," Ann. Math. Stat., 22, 1951, 68-78. 13. Hadi, "A Modification of a Method for the Detection of Outliers in Multivariate Samples" 1994, JRSSB 56:2, 393-396). 14. Hadi and Simonoff (1993), "Procedures for the Identification of Multiple Outliers in Linear Models", JASA, 88:424, 1264-1272. 15. Hampel, Ronchetti, Rousseuw, and Stahel, " Robust Statistics", John Wiley & Sons, 1986 16. Hawkins D. M. (1980) Identification of Outliers, Chapman and Hall, 1980. 17. Huber, "Robust Statistical Procedures", SIAM, 1977 (A new chapter was added in 1996). 18. Hu Yuzhu, Smeryers-Verbeke, J., Massart, D. L., "Outlier detection in calibration," Chemometrics and Intelligent Laboratory Systems, 9, 1990, 31-44. 19. Jones, M.C., and Sibson, R., What is Projection Pursuit, J. R. Statistical Society, Series A, Vol. 150, Part 1 1987, pp. 1-36 (Outlier discussion by Tukey, J.W., on pg. 33). 20. Lavine, M. (1991) Problems in Extrapolation Illustrated With Space Shuttle O-Ring Data, JASA, (86), 919-921. 21. Miller, J. N., "Outliers in experimental data and their treatment," Analyst, 118, May 1993, 455-461. 22. Mitschele, J., "Small sample statistics," J. Chem. Educ., 66 (6), June 1991, 470-473, and references. 23. Rosner's multiple outlier test Technometrics 25 No 2, May 1983, 165,172. 24. Rousseeuw, P. J., and A. M. Leroy (1987). "Robust Regression and Outlier Detection". Wiley, New York. 25. Tietjen, G. L. and R. H. Moore (1972). "Some Grubbs-Type Statistics for the Detection of Several Outliers", Technometrics, v14, (3), 583-597. 26. Weisberg, S. (1985). "Applied Linear Regression", 2nd ed. Wiley, New York. 27. Wilcox, Rand R. (1998) "How Many Discoveries have Been Lost by ignoring Modern Statistical Methods?" American Psychologist, Vol. 53, No. 3, 300-314. 28. Youden, W. J., as reported in column ("Out of the Editor's Basket") item "The Best Two out of Three?" in J. Chem. Educ., December 1949, 673-674.