## Subset selection in regression: the bad news

DM STAT-1 CONSULTING BRUCE RATNER, PhD
574 Flanders Drive North Woodmere, NY 11581 [email protected]
516.791.3544 fax 516.791.5075 1 800 DM STAT-1 www.dmstat1.com
2) Statistical criteria (e.g., R-squared, adjusted R-squared, Mallows’ Cp and MSE [3.1])
3) Statistical stopping rules (e.g., p-values flags for variable entry/deletion/staying in a model
The created body of unconfirmed thinking about the newborn-developed variable selection meth-ods was on bearing soil of expertness and adroitness in computer-automated, misguided statistics. The trinity distorts its components’ original theoretical and inferential meanings when they are framed within the newborn methods. The statistician executing the computer-driven trinity of sta-tistical apparatus in a seemingly intuitive, insightful way gave proof – face validity – that the prob-lem of variable selection, aka subset selection, was solved (at least to the uninitiated statistician). The newbie subset selection methods initially enjoyed wide acceptance with extensive use, and presently still do. Statisticians build at-risk accurate and stable models – either unknowingly using these unconfirmed methods or knowingly exercise these methods because they know not what to do. It was not long before these methods’ weaknesses, some contradictory, generated many com-mentaries in the literature. I itemize nine ever-present weaknesses, below, for two of the traditional variable selection methods, All-subset, and Stepwise. I concisely describe the five frequently used variable selection methods in the next section. 1. For All-subset selection with more than 40 variables: [3] a. The number of possible subsets can be huge. b. Often, there are several good models, although some are unstable. c. The best X variables may be no better than random variables, if size sample is relatively small to the number of all variables. d. The regression statistics and regression coefficients are biased. 2. All-subset selection regression can yield models that are too small. [4] 3. Why the number of candidate variables and not the number in the final model is the number of degrees of freedom to consider. [5] 4. The data analyst knows more than the computer … and failure to use that knowledge 5. Stepwise selection yields confidence limits that are far too narrow. [7] 6. Regarding frequency of obtaining authentic and noise variables … The degree of corre- lation among the predictor variables affected the frequency with which authentic pre-dictor variables found their way into the final model. The number of candidate predic-tor variables affected the number of noise variables that gained entry to the model. [8] 7. Stepwise selection will not necessarily produce the best model if there are redundant 8. There are two distinct questions here: (a) When is Stepwise selection appropriate? 9. As to question (b) above … there are two groups that are inclined to favor its usage. One consists of individuals, with little formal training in data analysis, which confuses knowledge of data analysis with knowledge of the syntax of SAS, SPSS, etc. They seem to figure that if its there in a program, its gotta be good and better than actually thinking about what my data might look like. They are fairly easy to spot and to con-demn in a right-thinking group of well-trained data analysts. However, there is also a second group who is often well trained …. They believe in statistics … given any properly obtained database, a suitable computer program can objectively make sub-stantive inferences without active consideration of the underlying hypotheses. … Stepwise selection is the parent of this line blind data analysis …. [11] Currently, there is burgeoning research that continues the original efforts of subset selection by shoring up its pseudo-theoretical foundation. It follows a line of examination that adds assump-tions and makes modifications for eliminating the weaknesses. As the traditional methods are be-ing mended, there are innovative approaches with starting points far afield from their traditional counterparts. There are freshly minted methods, like the enhanced variable selection method built-in the GenIQ Model, constantly being developed. [12] [13] [14] [15] II. Introduction Variable selection in regression – identifying the best subset among many variables to include in a model – is arguably hardest part of model building. Many variable selection methods exist because it provides a solution to one of the most important problems in statistics. [16] [17] Many statisticians know them, but few know they produce poorly performing models. The wanting variable selection methods are a miscarriage of statistics because there are developed by debasing sound statistical theory into a misguided pseudo-theoretical foundation. They are executed with computer-intensive search heuristics guided by rules-of-thumb. Each method uses a unique trio of elements, one from each component of the trinity of selection-components. [18] Different sets of elements typically produce different subsets. The number of variables in common with the different subsets is small, and the sizes of the subsets can vary considerably. An alternative view of the problem of variable selection is to examine certain subsets and select
the best subset, which either maximizes or minimizes an appropriate criterion. Two subsets are
obvious – the best single variable and the complete set of variables. The problem lies in selecting
an intermediate subset that is better than both of these extremes. Therefore, the issue is how to find
the necessary variables among the complete set of variables by deleting both irrelevant variables
(variables not affecting the dependent variable), and redundant variables (variables not adding
anything to the dependent variable). [19]
I review five frequently used variable selection methods. These everyday methods are found in
major statistical software packages. [20] The test-statistic for the first three methods uses either the
F statistic for a continuous dependent variable, or the G statistic for a binary dependent variable.
The test-statistic for the fourth method is either R-squared for a continuous dependent variable, or
the Score statistic for a binary dependent variable. The last method uses one of the criteria: R-
1. Forward Selection (FS) - This method adds variables to the model until no remaining variable (outside the model) can add anything significant to the dependent variable. FS begins with no variable in the model. For each variable, the test-statistic (TS), a measure of the variable’s contribution to the model, is calculated. The variable with the largest TS value that is greater than a preset value C is added to the model. Then the test-statistics is calculated again for the variables still remaining, and the evaluation process is repeated. Thus, variables are added to the model one by one until no remaining variable produces a TS value that is greater than C. Once a variable is in the model, it remains there. 2. Backward Elimination (BE) - This method deletes variables one by one from the model until all remaining variables are contribute something significant to the dependent variable. BE begins with a model which includes all variables. Variables are then deleted from the model one by one until all the variables remaining in the model have TS values greater than C. At each step, the variable showing the smallest contribution to the model (i.e., with the smallest TS value that is less than C) is deleted. 3. Stepwise (SW) - This method is a modification of the forward selection approach and differs in that variables already in the model do not necessarily stay. As in Forward Selection, SW adds variables to the model one at a time. Variables that have a TS value greater than C are added to the model. After a variable is added, however, SW looks at all the variables already included to delete any variable that does not have a TS value greater C. 4. R-squared (R-sq) - This method finds several subsets of different sizes that best predict the dependent variable. R-sq finds subsets of variables that best predict the dependent variable based on the appropriate TS. The best subset of size k has the largest TS value. For a continuous dependent variable, TS is the popular measure R-squared, the coefficient of multiple determination, which measures the proportion of the explained variance in the dependent variable by the multiple regression. For a binary dependent variable, TS is the theoretically correct but less-known Score statistic [21]. R-sq finds the best one-variable model, the best two-variable model, and so forth. However, it is unlikely that one subset will stand out as clearly being the best, as TS values are often bunched together. For example, they are equal in value when rounded at the, say, third place after the decimal point. [22] R-sq generates a number of subsets of each size, which allows the user to select a subset, possibly using nonstatistical conditions. 5. All-possible Subsets – This method builds all one-variable models, all two-variable models, and so on, until the last all-variable model is generated. The method requires a powerful
computer (because a lot of models are produced), and selection of any one of the criteria: R-
III. Weakness in the Stepwise An ideal variable selection method for regression models would find one or more subsets of variables that produce an optimal model. [22.1] Its objectives are that the resultant models include: accuracy, stability, parsimony, interpretability, and avoid bias in drwaing inferences. Needless to say, the above methods do not satisfy most of these goals. Each method has at least one drawback specific to its selection criterion. In addition to the nine weaknesses mentioned above, I itemize a complied list of weaknesses of the most popular Stepwise method. [ 23] 1. It yields R-squared values that are badly biased high. 2. The F and chi-squared tests quoted next to each variable on the printout do not have the 3. The method yields confidence intervals for effects and predicted values that are falsely nar- 4. It yields p-values that do not have the proper meaning and the proper correction for them is 5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining 6. It has severe problems in the presence of collinearity. 7. It is based on methods (e.g., F tests) that were intended to be used to test pre-specified hy- 8. Increasing the sample size doesn't help very much. 9. It allows us to not think about the problem. 11. The number of candidate predictor variables affected the number of noise variables that I add to the tally of weaknesses by stating common weaknesses in regression models, as well as those specifically related to OLS regression model and LRM: The everyday variable selection methods in regression model typically results in models having too many variables, an indicator of overfitted. The prediction errors, which are inflated by out-liers, are not stable. Thus, model implementation results in unsatisfactory performance. For or- dinary least squares regression, it is well known in the absence of normality or absence of line-arity assumption or outlier(s) presence in the data, variable selection methods poorly perform. For logistic regression, the reproducibility of the computer-automated variable-selection models is unstable and not reproducible. The variables selected as predictor variables in the models are sensitive to unaccounted for sample variation in the data. Given the litany of weaknesses cited, the lingering question is: Why do statisticians use variable selection methods to build regression models? To paraphrase Mark Twain: “Get your [data] first, and then you can distort them as you please.” [23.1] The author’s answer is: “Modelers use vari-able selection methods every day because they can.” As a counterpoint to the absurdity of “be-cause they can,” I enliven anew Tukey’s solution of Natural Seven-step Cycle of Statistical Model-ing and Analysis to defining a substantially performing regression model. I feel that newcomers to Tukey’s EDA need the Seven-step Cycle introduced within the narrative of Tukey’s analytic phi-losophy. Accordingly, I enfold the solution with front and back matter – The Essence of EDA, and The EDA School of Thought, respectively. I delve into the trinity of Tukey‘s masterwork; but first I discuss, below, an enhanced variable selection method, for which I might be the only exponent for appending this method to the current baseless arsenal of variable selection. IV. Enhanced Variable Selection Method In lay terms, the variable-selection problem in regression can be stated: Find the best combination of the original variables to include in a model. The variable selection method neither states nor implies that it has an attribute to concoct new variables stirred up by mixtures of the original variables. The attribute – data mining – is either overlooked, perhaps, because it is reflective of the simple-mindedness of the problem-solution at the onset, or is currently sidestepped as the problem is too difficult to solve. A variable selection method without a data mining attribute obviously hits a wall, which beyond it would otherwise increase the predictiveness of the technique. In today’s terms, the variable selection methods are without data mining capability. They cannot dig the data for the mining of potentially important new variables. (This attribute, which has never surfaced during my literature search, is a partial mystery to me.) Accordingly, I put forth a definition of an enhanced variable selection method: An enhanced variable selection method is one that identifies a subset that consists of the original variables and data-mined variables, whereby the latter are a result of the data-mining attribute of the method itself. The following five discussion-points clarify the attribute-weakness, and illustrate the concept of an enhanced variable selection method. 1. Consider the complete set of variables, X1, X2, ., X10. Any of the current variable selection in use finds the best combination of the original variables (say X1, X3, X7, X10); but, it can never automatically transform a variable (say transform X1 to log X1) if it were needed to increase the information content (predictive power) of that variable. Furthermore, none of
the methods can generate a re-expression of the original variables (perhaps X3/X7) if the
constructed variable, structure, were to offer more predictive power than the original
component variables combined. In other words, current variable selection methods cannot
find an enhanced subset, which needs, say, to include transformed and re-expressed
variables (possibly X1, X3, X7, X10, logX1, X3/X7). A subset of variables without the
potential of new structure offering more predictive power clearly limits the modeler in
building the best model.