Linear Regression
From Clariopedia
Contents |
Overview
The clario® node Linear regression is used to build a model for a continuous dependent attribute. The resulting model equation can be used to create a prediction score, based on one or more predictor attributes. Note that the dependent attribute must have been defined as a numeric attribute in a previous read node.
Usage
Input Stream
The node connector can be connected to a variety of nodes, (ie. Read, Aggregate, Append, Missing, etc.), but requires a valid stream of data.
Configuration
The linear regression node has two configuration faces, configuration and attribute selection
Configuration Faces
The Configuration face involves two different selections in the drop down boxes involving the Dependent Attribute and the Attribute Selection Method. Both of these selections must be completed to run a workflow. Choices for the Attribute Selection Method are None and Stepwise. If Stepwise is chosen, a box called “Stepwise Selection Options” appears below the Attribute Selection Method drop down box. In this box, you can select the Maximum p to Enter and Maximum p to Remove values for the stepwise regression.Attribute Selection
The Attribute Selection face involves selecting a desired attribute/s by clicking on them in the Available Attributes box and dragging Attributes into the Force Entry Attributes list box.
If Selection Method is ‘None,’ attributes must be selected for entry into the model. Select the attribute(s) by clicking on it in the Available Attributes box and drag and drop into the Force Entry Attributes box. If Selection Method is ‘Stepwise,’ select the attribute(s) by clicking on it in the Available Attributes box and drag and drop into the Candidate Attributes box. If there are any attributes that you wish to force into the model, select the attribute(s) by clicking on it in the Available Attributes box and drag and drop into the Force Entry Attributes box.NOTES: To efficiently find attribute names, begin typing an attribute name in the text box directly above Available Attributes. You will then be directed to the attribute(s) beginning with the letter(s) you type. To select multiple attributes at once, either use [Ctrl]+click to select multiple, one at a time, [Shift]+down arrow to select multiple in order of appearance, or use [Shift]+click to select the beginning and the ending attribute which will select all attributes. To de-select an attribute click on the attribute in the Candidate/Force Entry Attributes box and drag and drop into the Available Attributes box. Attributes in the Candidate/Force Entry Attributes list can be re-ordered by clicking and holding on an attribute and dragging it to the desired position within the Candidate/Force Entry Attributes box.
Field Definitions
- Valid Inputs – You must link to a valid data stream (ie. Read, Append, Filter, etc.).
- Attributes – You must provide a dependent attribute.
- Attributes – You must select at least one Candidate/Forced Attribute.
- "Dependent Attribute" must be numeric.
- "Available Attributes" must be numeric.
- "Candidate Attributes" must be numeric.
- "Force Entry Attributes" must be numeric.
- "Dependent Attribute" cannot be <NULL>.
Selection Method - Stepwise
- Attributes in "Dependent Attribute" cannot be in "Available Attributes," "Candidate Attributes", or "Force Entry Attributes."
- At least 1 attribute must be in "Candidate Attributes".
Selection Method - None
- At least 1 attribute must be in "Force Entry Attributes."
Results
There is one results face with two different tabs (Detailed Results and Step History) for the Linear Regression node. Detailed Results – This pop up box contains statistics such as R² and Adjusted R², Standard Error of Estimate, and Dependent Mean for each model step. It also contains, for each model attribute: name, regression coefficient, standard error, standardized coefficient, t value, p-value, and tolerance. This pop up also contains an ANOVA table which displays the F-statistic and corresponding p-value. Step History (for stepwise method only) – Contains one row of data for each step in the model building process. Each step lists the attribute entered or removed along with the step on which it was entered or removed and the resulting model R² for that step.Output Stream
The results from logistic can be read into Write, Score, and Evaluate. The results tables can also be exported into Excel.
Statistical Algorithm Reference
Because the clario framework does not make any assumptions about the length or width of the raw input data, we do not use any estimator that requires the full design matrix (X) and the vector of values of the dependent variable (Y) to be loaded into memory or written to disk, such as required by the regression estimator:
nor do we use the standard computation technique of singular value decomposition on the raw data matrices to handle cases of extreme data redundancy.
Clario solves the vector of regression coefficients β using basic ordinary least squares (OLS) techniques, together with corrective techniques for multicollinearity, but in a way which does not impose a priori conditions on the sizee of the data stream. A single pass is used to yield three components that are sufficient for producing the desired regression output:
- A correlation matrix of all selected predictor variables Rx'x
- A vector of correlations between the predictor variables and the dependent variable ry'x
- Univariate statistics, including the total sum of squares S'ST'O'T, on all independent and dependent variables.
In cases where Rx'x is not ill-conditioned the vector of standardized regression coefficients β is solved using:
In cases where the Rx'x matrix is ill-conditioned, clario will keep all of the chosen predictor variables in the model, and will automatically try again using a generalized inverse [REFERENCE]. The generalized inverse yields a linear regression solution regardless of the condition of the Rx'x matrix, but multiple regression coefficients solution might not be properly interpretable without first removing or accounting for the extreme redundancy. It is your choice whether to accept that solution or to reject it; you may instead elect to eliminate the problem at its root, perhaps by creating composite scales, or by removing some unnecessary variables. In any case, clario notifies you through the results log when the generalized inverse option has been forced into effect. When β has been solved, the raw coefficients are computed using the formula:
where si is the standard deviation of variable i and sy is the standard deviation of y.
The R2 statistic is the computed by: R2 = βry'x
If we let the number of independent variables be called k and the number of rows in the data N, the basic statistics computed above are used to produce all remaining clario results including the ANOVA table:
Computation of the ANOVA Table with k predictors and N rows:
| Source | df | SS | MS | F | p(F) |
| Regression | k | S'SR'E'G = R2(S'ST'O'T) | M'SR'E'G = S'SR'E'G / d'fR'E'G | F = M'SR'E'G / M'SR'E'S |
|
| Residual | N − k − 1 | S'SR'E'S = (1 − R2)(S'ST'O'T) | M'SR'E'S = S'SR'E'S / d'fR'E'S | ||
| TOTAL | N − 1 | S'ST'O'T | M'ST'O'T = S'ST'O'T / d'fT'O'T |
Mallow's Cp Statistic:
where M'SR'E'S is the mean square residual with all candidate variables entered, and S'SR'E'S is the sum of squares of the model with a specified subset of variables only.
The standard error of the estimate:
The standard errors of the coefficients:
where
is the variance of the variable i and Ri'i is the ith diagonal of the inverse of Rx'x.
or from raw coefficients:
T-tests for the slopes of coefficients:
Tolerance
The linear regression node implements stepwise and forced entry algorithms. Forced entry is performed according to the formulae outlined above. The stepwise algorithm uses the above formulae repeatedly on subsets of variables selected from the master correlation matrix R.
Stepwise Algorithm Computation
Step 1. Stepwise attempts to compute Mean Squared Error (MSE) using all candidate variables in the equation. If the correlation matrix containing all predictor variables is not invertible using ordinary Gaussian methods, the procedure will invoke the Generalized Inverse or Pseudo Inverse for the remaining steps.
Step 2. Enter the required variables in the regression equation and compute metrics.
Step 3. The following steps are executed until variable selection is complete.
3.1 Place all non-selected variables in the candidate pool.
3.2 For each candidate variable, calculate the tolerance of the regression coefficient when it is entered into the existing model along with the other variables. If tolerance <.0001 then skip the variable, otherwise continue.
3.3 Test the strength of the current test attribute's contribution to the model by computing the change-F test (Neter, Wasserman, &Kutner, 1985) of the full model (includes variable) against the reduced model (includes only the existing variables).
3.4 If none of the variables meet the p-to-enter criterion, variable selection is complete, otherwise, select the variable with the smallest p-value from the change-F test.
3.5 Next, test all variables in the model to see if any have lost explanatory power in the context of the added variable. This is done by computing, for each model variable, the t-test of the significance every semi-partial regression coefficient. Remove any variable from the model and place it in the candidate pool if its p-value is higher than the p-to-remove criterion.
3.6 Compute model statistics, the ANOVA table, and the variable metrics for this step.
Step 4. Compute the model statistics, ANOVA table, and the variable metrics for the final model.
Video Demonstration
References
None.
