Computation of the Odds Ratio with Small or Zero Cell Counts A dietician collected data concerning the effect of eating habits on the status of an unnamed physical condition for a group of subjects. Without going into details of the study, she wants to determine how the food they typically consume relates to the current status of their physical condition (there may be other factors, but diet is considered extremely important for this study). She converted summaries of their reported diets during the study period into four ordinal levels of nutrition (diet 1 is the best, diet 4 is the worst). Only levels 1, 2, and 3 of her diet categories applied to the sample of 50 subjects; category 4 of diet is extremely deficient in healthful benefits. The response variable is the success/failure for the subject showing noticeable improvement in their condition. The following table of responses was observed: ------------------------------- | | CONDITION IMPRV | | |------------------| | |Success | Failure | odds |DIET | | | |1st Diet | 9 | 11 | 9/11 |2nd Diet | 10 | 15 | 10/15 |3rd Diet | 0 | 5 | 0/ 5 |4th Diet | 0 | 0 | |Total | 19 | 31 | ------------------------------- Since no subjects were classified into the diet category #4, it was dropped from further consideration. What is important is to notice here the zero (0) count for the observation of 'success' with the diet category #3 (the lowest nutritional level of the three remaining diets). All five subjects from the sample for diet 3 failed to improve their condition. The objective is to estimate the odds of a 'success', that is the ratio of success to failure of improvement at each level of diet and then to compute the odds ratios for comparisons of combinations of the three diet levels. One difficulty occurs when computing a logistic regression equation with a categorical explanatory variable in the presence of a zero count in any one of the cells. The reason is shown by example since a zero in any cell when computing results in an odds ratio of 0 or infinity, as seen from the following calculations. Let the actual cell counts be represented by letters: Diet success failure 1 a b 2 c d 3 e f Odds of success for each diet: diet 1 = p1 / (1-p1) = a / b diet 2 = p2 / (1-p2) = c / d diet 3 = p3 / (1-p3) = e / f The odds ratio of a success for diet 1 compared with diet 2 is OR(success) = [ a / b ] / [ c / d ] = a*d / b*c In this equation, if any one number of a, b, c, or d are equal to zero the Odds Ratio will either be 0 or infinity; neither value of much interest as a measure of effect size. If only 1 subject at diet level 3 had recorded a success, the difficulties in the estimation of an odds ratio would be considerably reduced. However, in categorical data analysis, including logistic regression, there are often situations where zero counts in one (or more) of the cells will occur. How should one analyze data that appear to be a routine application of logistic regression, yet are not analyzable when the explanatory variable is categorical? The process examined is shown in portions of the following SAS outputs. First, enter the dietary data into SAS and use PROC LOGISTIC: DATA diet; input diet success failure diet_x; total=success+failure; cards; 1 9 11 1 2 10 15 2 3 0 5 4.5 ; PROC LOGISTIC ; CLASS diet / param=glm; MODEL success/total = diet / link=logit; RUN; The LOGISTIC Procedure Class Level Information Design Variables Class Value 1 2 3 diet 1 1 0 0 2 0 1 0 3 0 0 1 Model Convergence Status Quasi-complete separation of data points detected. WARNING: The maximum likelihood estimate may not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 68.406 67.176 SC 70.318 72.912 -2 Log L 66.406 61.176 WARNING: The validity of the model fit is questionable. Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 5.2302 2 0.0732 Score 3.5229 2 0.1718 Wald 0.1171 2 0.9431 Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq diet 2 0.1171 0.9431 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -12.2865 208.2 0.0035 0.9529 diet 1 1 12.0858 208.2 0.0034 0.9537 diet 2 1 11.8810 208.2 0.0033 0.9545 diet 3 0 0 . . . Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits diet 1 vs 3 >999.999 <0.001 >999.999 diet 2 vs 3 >999.999 <0.001 >999.999 Note that the coefficient estimates are extremely large (intercept=-12.2865 and diet(1)=12.0858 diet(2)=11.8810) and that the two slopes vary widely in magnitude from the intercept because the procedure is essentially attempting to estimate an infinite odds ratio for diets 1 and 2 in relation to diet 3. One could analyze the data using only diets 1 and 2 with no estimation problems: PROC LOGISTIC ; WHERE (1 <= diet <= 2); CLASS diet / param=glm; MODEL success/total = diet / link=logit; RUN; Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq diet 1 0.1138 0.7359 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -0.4055 0.4082 0.9864 0.3206 diet 1 1 0.2048 0.6072 0.1138 0.7359 diet 2 0 0 . . . Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits diet 1 vs 2 1.227 0.373 4.034 Another option (not shown) is to classify the subjects from diet categories 2 and 3 into one category and run the above analysis. Yet, if possible, information about all subjects responded to all three diets should be obtained from the study, particularly how diet 3 compares with 1 and 2, since it appears to have a more adverse effect on the number of successes than either diet 1 or diet 2, i.e. an increasing lack of success in the treatment program appears to be directly related to increasingly bad diets. That 5 subjects failed and 0 succeeded at diet level 3 is important information to utilize. Described next are a few ways (some better than others) to proceed. A 'weighted analysis' could be performed where the 50 observations are given a weight of 1 in the sample. For diet 3 (with the 0 count) assign it a relatively small weight of say, 0.05. This will allow the estimation procedure to converge without placing too much emphasis on the 'false' record. DATA diet_2; SET diet; KEEP diet y count; IF success = 0 then success=.05; y=1; count=success; OUTPUT; Y=0; count=failure; OUTPUT; RUN; PROC LOGISTIC DATA=diet_2 descending; CLASS diet / param=glm; MODEL y = diet / link=logit; WEIGHT count; RUN; However, the estimated odds ratios are still quite large [exp(4.4043) and exp(4.1996) are both > 100 ]. More 'reasonable' estimates are to be found with other approaches. Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -4.6050 4.4941 1.0500 0.3055 diet 1 1 4.4043 4.5165 0.9509 0.3295 diet 2 1 4.1996 4.5126 0.8661 0.3520 diet 3 0 0 Another method is to treat the categorical variable in this study as 'ordinal'. If one can derive a scale that places an ordering between diets 1, 2, and 3 in an ordered, numerical relationship, then the explanatory variable diet can be treated as a regressor variable (with one degree of freedom). The client considered the diets 1, 2, 3, and 4 to be in the relation of 1, 2, 4.5, and 10 so a new 'continuous' variable was created called 'diet_x' where: diet_x = 1 for diet 1 diet_x = 2 for diet 2 diet_x = 4.5 for diet 3 PROC LOGISTIC ; MODEL success/n = diet_x / link=logit; RUN; These statements give the following partial output: Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.6945 0.7328 0.8982 0.3433 diet_x 1 -0.6794 0.4086 2.7655 0.0963 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits diet_x 0.507 0.228 1.129 There is no problem in the parameter estimation procedure with this approach. The difficult, if not impossible requirement, is to find a good numerical scale to represent the various levels of nutrition. There is yet one more option to consider..... EXACT methods! SAS can perform logistic regression for categorical predictors when there are small counts (even 0's) in one or more of the cells. Just enter the data in the table form as shown at the top of this message, or in the structure of case data (spreadsheet form) where y = 0 (=failure) 1 (=success) diet = 1, 2, 3 The variable success equals the sum of the y's for each level of diet. Then apply PROC LOGISTIC and be sure to assign diet as a CLASS variable (otherwise SAS will treat it as ordinal with no estimation problems). Enter the EXACT statement from the menu: PROC LOGISTIC DATA=diet EXACTOPTIONS(maxtime=60); CLASS diet(REF='3') / PARAM=ref; MODEL success/total = diet / link=logit; EXACT diet / ESTIMATE=odds; RUN; Note: the parameterization param=glm does not work when calculating EXACT tests. Exact Conditional Analysis Conditional Exact Tests --- p-Value --- Effect Test Statistic Exact Mid diet Score 3.4525 0.2065 0.1974 Probability 0.0181 0.2065 0.1974 Exact Odds Ratios 95% Confidence Parameter Estimate Limits p-Value diet 1 4.813* 0.565 Infinity 0.1644 diet 2 4.039* 0.485 Infinity 0.2176 NOTE: * indicates a median unbiased estimate. Even though the maximum likelihood solution (the results from PROC LOGISTIC without the EXACT statement) do not give reasonable estimates, the exact test gives a p-value and estimated parameters for diets 1 and 2 (where diet 3 is chosen here to be the reference category). The results show that the odds ratio for diet 1 compared with diet 3 is 4.813 although the p-value is not significant (0.1644). Also, the odds ratio for diet 2 compared with diet 3 is 4.039, also with non-significant p-value of 0.2176. These results reveal much more 'conservative' estimates of the odds ratios than found with the weighted or regression analyses given above. What number of failures in diet 3 would it take to achieve significance with this data set? If 3 more subjects had been included in the sample who also failed at diet 3 (i.e., the number of failures=8), the p-value for diet 1 would be significant at the 0.05 level (0.0486) with an associated odds ratio of diet 1 compared to diet 3 equal to 7.866. Complete Separation A problem in estimation arises when there is some linear combination of the explanatory variables that perfectly predicts the dependent variable. With categorical explanatory variables in logistic regression this problem is called 'complete' (or quasi-complete) separation. It happens when all (or most) of the responses at one of the levels of the categorical variable are successes and all (or most) of the responses at another level are failures (i.e., very small numbers of failures/successes in relation to the total at each level). Suppose that you have a predictor variable which takes on only two values. Complete separation exists with the layout: Response Failure | Success --------|-------- Predictor 0 | 25 | 0 | |-------|-------| 1 | 0 | 21 | ----------------- There are no successes when the value of the predictor variable is 0, and there are no failures when the value of the predictor variable is 1. Obviously, this represents very good classification, yet the maximum likelihood estimates of the odds ratio cannot be obtained as the value p/(1-p) is not defined for one of the levels of the predictor variables. When using SAS PROC LOGISTIC you would get this warning message: >>>"There is possibly a quasicomplete separation in the sample points. The maximum likelihood estimate may not exist". If one of the off-diagonal cells is nonzero, then quasi-complete separation exists. Depending on how the data are set up, you may not be able to compute the odds ratio [p1/(1-p1)]/[p2/(1-p2)]. In the study described here, this situation would be represented by hypothetical data as: ------------------------------- | | ABSTINC | | |------------------| | |Success | Failure | |----------|--------|---------| |NUTR_ST | | | |1st Diet | 20 | 0 | |2nd Diet | 10 | 15 | |3rd_Diet | 0 | 5 | |----------|--------|---------| |Total | 30 | 20 | ------------------------------- Diet 1 and diet 3 are completely separated since 0 counts exist in opposite columns. However, there is an indirect link through diet 2 which has both success and failures. Hosmer and Lemeshow discuss both of these problems in "Applied Logistic Regression" (Wiley--1989). Another useful summary is given by Scott Menard in "Applied Logistic Regression Analysis" (Sage publication #106).