Reduce
From Clariopedia
Contents |
Overview
The clario® node Reduce is used to reduce the number of numeric attributes by eliminating those with high levels of multi-collinearity. For groups of attributes that are highly correlated with each other, only the attributes most related to the dependent attribute will be retained.
First, factor analysis is used to find the unique dimensions of the data. Ordinary Least Squares (OLS) stepwise regression is then used to determine which of the attributes from each factor are most strongly related to the dependent attribute. The strongest attributes from each factor survive, and there is typically representation from each factor in the resulting dataset.
Usage
Input Stream
The node connector can be connected to a variety of nodes, (ie. Read, Aggregate, Append, Missing, etc.), but requires a valid stream of data.
Configuration
The Reduce node has two configuration faces.
Configuration 1The first configuration face contains three list boxes: Available Attributes, Selected Attributes and Pass-Through Attributes. Drag and drop the attributes you want to process through reduce into the Selected Attributes box. Drag and drop any attributes you want to keep on the resulting dataset, but don’t want to process as part of the reduce node in the Pass-Through Attributes box. Leave any attributes that you want to totally drop in the Available Attributes box.
NOTES: To efficiently find attribute names, begin typing an attribute name in the text box directly above Available Attributes. You will then be directed to the attribute(s) beginning with the letter(s) you type. To select multiple attributes at once, either use [Ctrl]+click to select multiple, one at a time, [Shift]+down arrow to select multiple in order of appearance, or use [Shift]+click to select the beginning and the ending attribute which will select all attributes. To de-select an attribute click on the attribute in either the Selected Attributes box or the Pass-Through Attributes box, and drag and drop into the Available Attributes box. Attributes in either the Selected Attributes or Pass-Through Attributes list can be re-ordered by clicking and holding on an attribute and dragging it to the desired position.
Configuration 2
The second configuration face contains two separate areas: Factor Analysis options and Linear Regression Options.
If Automatic is selected, you need to also specify:
• Proportion of Variance %: the percent of variance in the dataset you want to explain with the set of factors. Valid range is from 0 to 100 with one digit of decimal precision.• Rotation Method: the rotation method you want to use in the factor analysis, choices being none, varimax and equamax
• Write to File: the dataset name (either an existing dataset name you want to write over, or a new dataset name)
If Manual is selected, you need to also specify:
• Number of Factors: the number of factors you want to create to represent the dataset. Valid range is greater than or equal to 1, and less than or equal to the number of attributes in Selected Attributes on Configuration Face 1.
• Rotation Method: the rotation method you want to use in the factor analysis, choices being none, varimax and equamax
• Write to File: the dataset name (either an existing dataset name you want to write over, or a new dataset name)
NOTES: To create more factors using the automatic method, raise the % of variance explained. To create fewer factors using the automatic method, lower the % of variance explained. To create more factors using the manual method, simply increase the number of factors. To create fewer factors using the manual method, simply decrease the number of factors.
In Linear Regression Options, you are defining the rules for step two of the reduce node, which will build a stepwise linear model for each of the factors built in step one.In the Linear Regression Options, you first need to select your dependent attribute from the drop down box. Next you select the maximum p value to enter an attribute into each regression, and the minimum p value to remove an attribute from each regression. Finally, you specify the minimum tolerance value to keep an attribute in each regression.
NOTES: To keep more attributes, raise the p to enter and p to remove criteria, and/or lower the minimum tolerance. To keep fewer attributes, lower the p to enter and p to remove criteria, and/or raise the minimum tolerance.
Field Definitions
- Available Attributes (and therefore Selected, and Pass-Through Attributes) must be numeric.
- At least 1 attribute must be placed in the "Selected Attributes" list.
- Pass-Through Attributes will not be included in factor analysis or linear regression.
- All spinners should be disabled unless at least one attribute is in Selected Attributes.
- The value for "Number of Factors" cannot exceed the number of attributes in the "Selected Attributes" column.
- An attribute that has been selected for the "Dependent Attribute" cannot be in the "Selected" or "Pass-Through" lists.
Results
There is one results face for the Reduce node containing:Number of factors: Number of factors as a result of step 1 (the factor analysis step)
Total attributes considered: Number of non-constant attributes processed in the factor analysis step
Total attributes selected: Number of final attributes in the resulting dataset, after step 2 (linear regression).
The results face also contains a table listing all the attributes considered and selected in the Selected Attributes box of Configuration Face 1. These columns in the table include:
Attribute: Name of the attribute
Factor: The factor that the attribute belongs to (represented by a number).
Reason: Reason the attribute was either kept or dropped.
Possible reasons include:
- Selected – this attribute is one of the final selected attributes, written to the final output file
- Nonsignif_Rejected – this attribute was rejected because it was not significant in the factor regression
- Collinear_Rejected – this attribute was rejected because it is highly correlated with another attribute in the same factor
- Constant_Rejected – this attribute was not considered in the factor analysis or regressions, because it has a constant value
- Passthrough_Attribute – this attribute is on the final output dataset, because you specified it as a Passthrough attribute on Configuration Face 1
You can sort this table on Factor Number, to see which attributes fall into the same factor. You can also sort this table on Reason, to see which attributes are selected, or rejected for various reasons. Finally, you can sort this table on Attribute name, to quickly find a specific attribute in alphabetical order.
Output Stream
The output of the factor node is a dataset, with the name you specified in Configuration Face 1. You can read in this ‘reduced’ dataset using a Read node, and continue your analysis.
Video Demonstration
References
None.
