Data Coding Issues with Logistic Regression by Robin High Statistical Consultant University of Oregon Response data entered into a statistical model which are categorical variables typically have a relatively small number of distinct levels. For example, dichotomous data have two levels such as yes/no or success/failure. When a categorical variable has three or more levels, they may be defined either as nominal such that the distinct levels of the variable have no inherent order (e.g., levels of race, occupation, religious preference) or ordinal where the order of the various levels does matter (e.g., Likert scales, degree of preference, income, or age levels). In the case where the variables have only two levels the distinction between treating it as nominal or ordinal in a statistical model is not relevant other than which level is of greater interest to the researcher. A convenient way to work with categorical data is to aggregate the number of times each level of the response is observed (integer counts) across all subjects. Like interval or continuous numerical explanatory data, categorical data can be entered into a statistical model among the independent variables to predict a continuous response. The classification of a categorical factor in a one-way ANOVA is among the first examples which demonstrate this application. However, categorical data can also be the dependent or the response variable of interest, that is, a type of data you want to analyze in a similar manner you would analyze a continuous response such as with linear regression or ANOVA. As the name implies, the designation dichotomous or binary response implies the variable should be coded with two distinct levels. Dichotomous data have very different distributional assumptions than interval or continuous data commonly analyzed with techniques such as linear regression or ANOVA. SAS offers several statistical procedures for the analysis of categorical data as the response variable of interest. The purpose of this article is to describe how to code binary data to produce equivalent results from three procedures especially designed for categorical data analysis. Since dichotomous data are the most basic type of discrete data for both the response and explanatory variables, the path to understanding coding and analysis techniques will begin with them. It is rarely appropropriate to recode continuous or interval data into discrete categories. Situations where recoding data to binary are when a "continuous" response variable consists of mostly zeros and only a few positive values, when the response is measured with considerable error (e.g. subjective opinions), or when data have large outliers and a non-parametric approahc is not appropriate. And you should also not automatically recode categorical data with three or more distinct levels into two levels. Small sample sizes in some of the levels of the response may be a necessary reason for doing so. If you have the actual values of continuous or interval level data, you should not recode them into categories since the process automatically adds measurement error which biases the estimation procedures. For legitimate two-level responses, an informative way to code this type of variable is to enter a 0 or 1 for each level, e.g., No=0/Yes=1. Formats can then be assigned to the variable names so the meaning of each level remains intact. Other coding schemes may also work; however, as will be shown in this article, when applying statistical techniques the 0/1 coding approach proves to be the most convenient. Now, to the real purpose of this article: "Data analysis would be so easy if the analyst didn't need to consider all the assumptions." You should recall two very important assumptions concerning logistic regression: * the response of a "success" occurs with a constant probability p (0 < p < 1 ) for every trial in each group * the responses are independent across the trials, i.e., a "success" on any given trial should not influence a "success" on any another trial. If these assumptions are true, one way to summarize them for each group is to count the number of "success" responses over the independent trials. One of the first aspects of understanding how to code dichotomous data as a response variable is to clearly define which of the two levels is of greater interest to you (or to decide on one if you can consider each level of equal interpretive value). Suppose your research statement is phrased something like: When asked a question, subjects from Group A will give a response of "Yes" differently than subjects from Group B will give a response "Yes". A statement such as this states that you should evaluate the frequency of a "Yes" response from Group A (the at risk group) and compare it to the number of "Yes" responses from Group B (the reference group). In this situation "Yes" will be referred to as the level of greater interest. This is where coding response data as integers, rsp=0 for No and rsp=1 for Yes, comes in handy: their sum, y, is the number of "yes" responses recorded out of the total number of independent trials, n. Suppose you have n subjects in Group A and you count the number that said "yes". The new variable y is distributed as a binomial random variable with parameters n and p and is notated y ~ binomial(n,p) where p is the probability of a "yes". Given this scenario, what statistical procedure should be employed to analyze a dichotomous response variable? Coded as either 0/1 or aggregated to compute a proportion = y/n, your first inclination might be to analyze these data with ANOVA or regression methods. In fact, a rather old journal article discussed this very situation with the analysis of 0/1 response data (Ref: Lunney, 1970). The author essentially concluded that under certain conditions, ANOVA is a reasonably good approximation, despite the fact that some of the key assumptions behind these statistical techniques are obviously not met, e.g., normally distributed residuals. Among the reasons you should not enter dichotomous data as the dependent or response variable in linear regression or ANOVA include: 1) ANOVA methods assume a continuous response variable that ranges from negative infinity to positive infinity whereas dichotomous data are bounded by 0 and 1. When analyzing continuous data this open interval does not need to be literally true, but with dichotomous data the defined bounds may be much too restrictive. 2) Linear regression and ANOVA typically assume homoscedasticity of the residuals; that is, the residuals from a model have equal variance across groups or over the range of explanatory data. The variance of a proportion depends on the value of its mean (which is a function of the levels of the independent variables), so it doesn't meet this assumption. For aggregated 0/1 data, e.g., y successes in n independent trials, the arcsine transformation has historically been applied to the count data which were then analyzed with ANOVA. However, with access to modern statistical programs neither approach needs to be chosen in most situations. With advancing computing technology and statistical theory since 1970, we can do MUCH BETTER! Applying ANOVA on 0/1 response data or proportions would be about the same situation as giving up your cell phone for a rotary dial with a party line or returning to calculations with a slide rule rather than a program like SAS on your laptop. Analysis techniques especially designed for dichotomous data are logistic or probit regression models (among a few other names). The logistic model with a logit link is often preferred since the logit is the natural parameter of the binomial distribution in exponential form. It is also naturally connected to interpreting results with odds and odds ratios. [Note: the focus in this article is on logistic regression, though with most datasets similar results will likely be found with probit.] Logistic regression is similar to linear regression and ANOVA in that the independent or explanatory variables may be either continuous, categorical, or a combination of both and it produces model coefficients and tests of significance. Multinomial logistic regression is the natural extention to the binomial distribution. It handles the case of a response variable with three or more response choices, which may be ordinal (ordered categories) or nominal (no order can be assumed). SAS has several procedures that will compute results from logistic regression models, namely PROCs GENMOD, LOGISTIC, and NLMIXED. Also add to this list PROCs FREQ, PROBIT, CATMOD, GLIMMIX, and SURVEYLOGISTIC. Your immediate reaction may be one of great puzzlement as to why SAS provides so many ways to handle what appears to be the same data analysis situation. One purpose of this article is to present some important differences and similarities in coding data for input into the first three procedures listed and introduce advantages one choice may have over the others. The other procedures also have specialized applications for categorical data, though will not be treated within this document. First, the objective is to code data that will produce easily interpretable results and produce consistent output when read into any of these procedures. Throughout this article I will assume the dichotomous response variable is coded as 0/1. The important point is to code the level of greatest interest to you as the value that sorts last in an ascending order, in this case a 1. How you code the independent categorical variables will also determine the results you observe on the output. By default, most procedures in SAS that include a CLASS statement assign the "largest" coded level (i.e., the one that sorts last in an ascending order) as the reference category; that is, estimated coefficients of all other categories are compared with it. This is important to know if you want to state how females compare to males on the response (i.e., the level of the response coded as 1) which implies females should be coded as gender=0 and males as gender=1. In this case coding gender as character "F" and "M" also works, since "F" precedes "M" when sorting. Also, what is the implication of coding gender as 1 and 2 and treating the explanatory data as numeric rather than categorical. With these principles in mind, a rather small dataset (for illustration of interpreting parameter estimates) is shown below where 5 women and 4 men, selected randomly, were asked a basic yes/no question where "yes" is the response value of greater interest. The column labeled gender codes the subject as "F" or "M" for female/male. The next two columns are different ways to dummy code gender, depending if Female or Male is the desired reference category. The final column codes gender numerically as 1 and 2. These values will be read into a SAS dataset called 'one' in the subsequent examples. response gender gndrF gndrM gnd_n 1 F 1 0 1 1 F 1 0 1 0 F 1 0 1 0 F 1 0 1 0 F 1 0 1 1 M 0 1 2 1 M 0 1 2 1 M 0 1 2 0 M 0 1 2 By aggregating the response where y = Sum(response) for each level of gender, the same dataset could also be represented as: y n gender gnd_n 2 5 F 1 3 4 M 2 PROCs GENMOD and LOGISTIC Two important procedures in SAS for estimating logistic regression models include PROC GENMOD and PROC LOGISTIC. GENMOD is an acronym for generalized linear models (introduced in Dobson, 2002). Application of this procedure for count data is similar to how you would apply PROC GLM (general linear model) or PROC MIXED (mixed linear models) when estimating linear models with continuous data. The CLASS statement of PROC GENMOD functions as it does in these other procedures in that it assumes GLM coding for classification variables (i.e., it produces k dummy values coded 0/1 where k is the number of distinct levels). The dist= and link= options on the MODEL statement allow the response variable to assume other distributions other than the normal including binomial (for dichotomous response data) and poisson (non-negative counts). The basic syntax for logistic regression models with PROC GENMOD with a binary response variable coded 0/1 and a two-level categorical variable, gender, is: PROC GENMOD DATA=one descending; CLASS gender; MODEL response = gender / dist=binomial link=logit type3; TITLE1 "Logistic Regression with GENMOD"; RUN; PROC LOGISTIC is designed for analysis of dichotomous and ordinal data (e.g., Likert Scales). In earlier versions of SAS, LOGISTIC worked in an analogous manner to the linear regression procedure, PROC REG, in that all input data needed to be coded numerically, including classification factors (i.e., gender would need to be dummy coded as 0/1). PROC LOGISTIC includes a CLASS statement; however, if the CLASS statement contains a discrete variable name only (as the one entered for PROC GENMOD), it will assign effects coding (-1/1). The most basic syntax for logistic regression to match PROC GENMOD is: PROC LOGISTIC DATA=one descending; CLASS gender / PARAM=glm; MODEL response = gender / expb; TITLE1 "Logistic Regression with LOGISTIC"; RUN; One important feature of GENMOD and LOGISTIC shown here is they both include the descending option on the PROC statement which means the probability of the "highest" level of the response will be computed. With binary responses coded as 0/1 this means the probability that the response equals 1 will be analyzed rather than the probability the response equals 0 (lowest sorted order, which is the default). The MODEL statement for the two procedures looks similar to the left of the /; however the choices for options in each are quite different. In GENMOD you must specify the binomial distribution (dist=binomial). The "logit" link is the default value for the binomial; you may still want to enter "logit" (and do not another valid choice "log") to remind yourself of this feature in order to produce the same results as LOGISTIC. Both procedures allow you to enter aggregated data with the event/trials notation for the response (that is, y/n implies y subjects answered "Yes" in n trials) to the left of the = sign; with this coding of the response the "descending" option no longer applies. Confusion between output from the two procedures may result from the way each one applies the CLASS statement to categorical data. GENMOD automatically assigns dummy 0/1 coding of categorical data (like PROCs GLM or MIXED) treating the highest numerical or character code of the categorical variable as the reference category (or its formatted level, if assigned a format). With PROC LOGISTIC, classification factors are assigned effects coding by default, that is, distinct levels coded internally with -1/1. This parameterization means the estimated coefficients computed with PROC LOGISITIC will be 1/2 the magnitude produced with PROC GENMOD (adding the PARAM=glm option on the CLASS statement with LOGISTIC after a / will treat gender the same way as GENMOD). The default choice of paramerization with PROC LOGISTIC may be a carryover from how PROC CATMOD treats categorical data in logistic regression. Results from the basic syntax for the GENMOD and LOGISTIC procedures described above estimate the following model coefficients from the sample data: Parameter Estimate Intercept 1.0986 gender F -1.5041 gender M 0 Since the level of gender "M" is the reference category, the Odds the chosen level of the response equal to 1 is larger for Males than it is for females. That is, the odds ratio is EXP(-1.5041) = 0.222 which implies males are more likely to have given a response of 1 than females. What about coding gender with numbers 1 and 2? Suppose you code gender with 1=Females and 2=Males and analyze the response with gender coded numerically. The syntax for LOGISTIC (and GENMOD) is the same as above with no CLASS statement: ODS SELECT parameterestimates; PROC LOGISTIC DATA=one descending; MODEL response = gender_n ; RUN; Parameter DF Estimate Intercept 1 -1.9095 gender_n 1 1.5041 You'll first notice the signs of both the intercept and "slope" are the opposite as shown before. This is because 1=Female is actually the reference category in this coding scheme. The coefficient for gender represents a 1 point increase in the logit from coding gender as 1 and 2. Results are the same when numerically coding the data 0/1; they would be quite different if one were to code levels of gender as 1=Female and 3=Male. Also note that because LOGISTIC works with effects coding with variables on the CLASS statement, entering the same commands and treating gender_n as categorical will give very different numerical results: PROC LOGISTIC DATA=one descending; CLASS gender_n; MODEL response = gender_n ; RUN; Parameter Estimate Intercept 0.3466 gender_n 1 -0.7520 In this case, males have returned to the reference category and because effects coding (-1,1) was applied, one needs to multiple the coefficient estimate by 2 to match the values shown earlier. Needless to say, not understanding the difference in coding assumptions and how the CLASS statement works can generate what on the surface appears to be conflicting results. Changing the Level of the Reference Category One feature of the CLASS statement in LOGISTIC are options to change the default parameterization to match GENMOD. You may also want to enter an option to change the reference category of a categorical variable in the CLASS statement. For example, attach (descending) to choose the Female to serve as the reference category for gender instead of Males. PROC GENMOD DATA=one descending ; CLASS gender(descending); MODEL response = gender / dist=binomial link=logit; RUN; In this example, the descending option on the PROC statement implies the probability of a 1 is the outcome of interest. The (descending) attched to gender on the CLASS statement implies Females (coded F) will serve as the reference category. You can also specify your choice for the reference category with REF= option which will apply to all categorical variables listed on the CLASS statement: CLASS gender / PARAM=ref REF=first; The REF=first option is of value if you code gender as 'F' and 'M' and you want to compare Males to Females (i.e., females are the reference category). Without the REF=first option, PROC LOGISTIC will compare Females to Males. Perhaps future versions of SAS will allow this convention for categorical data with other procedures, such as GLM and MIXED. For now, one needs to pursue other ways to assign reference categories of your choice. Understanding how these options work are essential to recognize how to interpret the signs of estimated coefficients. Another nice feature of PROC LOGISTIC is to change the reference category for the response variable with the MODEL statement. Assume you have response coded with 0/1 and assigned 0=no/1=yes with a format: PROC FORMAT; VALUE gnd 0='No' and 1='Yes'; RUN; PROC LOGISTIC DATA=two; CLASS gender / param=glm; MODEL response(ref='No') = gender; FORMAT response gnd. ; Run; This flexibility is helpful when analyzing response data with more than two levels and you need to specify a level as the reference category for which the natural order of the data or the descending option don't work. Perceptive followers of SAS code for logistic regression models who have run the two procedures on the same data set may notice differences in the computations of pvalues. PROC LOGISTIC gives Wald type test statistics for the parameter estimates. However, likelihood ratio tests along with the WALD tests are available with PROC GENMOD which you can obtain from insertion of the option 'type3' on the MODEL and entering a CONTRAST statement: PROC GENMOD DATA=one descending; CLASS gender; MODEL response = gender / dist=binomial link=logit type3; CONTRAST 'gender: lr test' gender 1 -1 ; CONTRAST 'gender: wald test' gender 1 -1 / wald; ESTIMATE 'gender odds ratio' gender 1 -1 / exp ; TITLE1 "Logistic Regression with GENMOD"; RUN; The type of test from the CONTRAST statement by default is the profile likelihood ratio (LR) test. You may specify results from CONTRAST statements to be Wald tests by adding "wald" as an option on this statement as indicated above. The test then for each individual contrast for the parameters is identical to the p-value returned in the table of parameter estimates. These two tests are based on specific theoretical approaches which are asymptotically equivalent, yet may give very different results, often observed with relatively small sample sizes. The Wald test will likely compute a larger p-value than the corresponding test with the LR, especially when the logit coefficient is large. Yet another way to compute logistic regression equations is PROC NLMIXED. However, it is often most useful when need to enter the formula that will compute maximum likelihood estimates (as occur in zero-inflated or truncated models for count data). It also requires that you dummy code all categorical data with 0/1 values for each level (in a DATA step or with PROC GLMMOD) as it does not have a CLASS statement. Although it has a built in binomial function, it is more instructive to observe how one can write statements to analyze dichotomous data with logistic regression directly with the likelihood equation. PROC NLMIXED DATA=one; PARMS intrc = -.1 _gender = -.1; eta = intrc + ( _gender * gender_F); prb_1 = EXP(eta) / (1 + EXP(eta)); liklhd =(prb_1**response) * ((1-prb_1)**(1-response)); loglik = LOG(liklhd); MODEL response ~ GENERAL(loglik); RUN; The PARMS statements sets initial values to the parameter estimates. If not present, they are automatically set to 1. With more complicated models and datasets, it is important to have starting values that will enable the estimation process to converge. In fact, some combinations of initial values may cause the program to stop without any output. Finding different starting values, perhaps with approximations from other procedures, may help. The equation for eta is the linear predictor, a function of the intercept and dummy coded gender (here the entry of gndrF implies "Males" is treated as the reference category, the same assumption as GENMOD or LOGISTIC). Eta enters the formula to compute prb_1, the probability that response=1; entering this formula produces the same result as the descending option discussed above. The log-likelihood obtained from probability equation is then maximized to compute the parameter estimates in Table 1. Which logistic regression procedure is best for your data? The answer to this question depends on many factors, but essentially here are the major differences among the three procedures discussed: PROC GENMOD presents a unified approach to the analysis of categorical data, including Poisson and Negative Binomial (for counts), gamma, and normally distributed data (for normally distributed data, PROCs GLM, REG, or MIXED will likely work much better). It also handles repeated measures for count data much like PROC MIXED works with repeated measures for continuous data. PROC LOGISTIC is designed for regression applications with one dichotomous response collected from each subject or several independent responses aggregated over subjects or trials for each group (binomial). It is the procedure designed to compute ROC curves. It can also perform "exact" logistic regression results when you have small sample sizes or 0 counts appear in some of the cells from a table showing the distribution of the explantory categorical data (a technique that may be of value when given the warning "quasi-complete separation" in the log file). PROC NLMIXED provides great flexilibility in writing program statements that allows you to start with logistic regression and expand into more complicated models. For example, it computes random effects models for count data and allows you to enter formulas for non-standard probability distributions (much like this dichotomous data example). One of its most interesting applications is the ability to compute estimates for zero-inflated models. In this situation an extra large number of legitimate zeros appear in your dataset such that overdispersion becomes a concern. PROC GENMOD has an option to work with this problem, namely the scale= option on the MODEL statement, although a zero-inflated model may be a more attractive solution. Note that this problem is sometimes mistaken for censored data (the value observed is an upper or lower bound for the actual value), such as computed with Tobit models. Legitimate 0's and censored observations are two very different problems and NLMIXED allows you to treat both analysis approaches as such (see Chapters 7 and 8 of Long). References: Allison, P.D. (1999), Logistic Regression Using the SAS System: Theory and Application, Cary, NC, SAS Institute Inc. Hosmer D.W. and Lemeshow S. (2002) Applied Logistic Regression, Second Ed. New York: John Wilely & Sons. Long, J. Scott. (1997) Regression Models for Categorical and Limited Dependent Variables. Sage: Thousand Oaks, CA. Lunney, Gerald H. (Winter, 1970) "Using Analysis of Variance with a Dichotomous Dependent Variable: An Empirical Study". Journal of Educational Measurement Vol 7, #4, pp. 263-269. Menard, Scott. (1995). "Applied logistic regression analysis" (Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-106). Thousand Oaks, CA: Sage.