Welcome to the Variable Selection Accelerator!

Business Needs for a Variable Selection Accelerator:

1.) We want to explain the data in the simplest way, therefore, redundant predictors should be removed.

2.) Unnecessary predictors will add noise to the estimation of other quantities that we are interested in. Thereby not giving us an accurate estimation.

3.) If the model is to be used for prediction, we can save time and/or money by not measuring redundant predictors.

Test the App

To test the app, you can simply select a pre-loaded dataset.
After which you can follow instructions from the next box.

How to use the Variable Selection Accelerator:

1.) Start by uploading your csv file in the Side Menu Panel. Note that if your data contains missing values, the rows containing said missing values will be omitted.

2.) Select your Independent Variables from the Summary of your data under the `Data Summary` Tab.

3.) Then open the `Variable Selection` Tab and select your target variable (or independent variable).

4.) The algorithms will run on the uploaded dataset and provide the variables selected in a summary.

5.) You may change the controls of the algorithms from the Side Menu Panel, and also choose which results are visible to you.

Summary of Data

Select your independent variables from the table

Numerical Data

Loading...

Categorical Data

Loading...

Correlation Matrix for selected numeric variables

Correlation Matrix

Correlation Plot

Correlation Plot

You haven't selected sufficient number of numerical variables for a correlation matrix. Remember that a correlation matrix requires at least 2 variables.

First 5 lines of data

Select a Target Variable to begin

    Comparison of Selected Algorithms of Classification

    Loading...

    Comparison of Selected Algorithms of Regression

    Loading...

    Step-wise Regression - p Value

    Algorithm Results

    Loading...
    This algorithm uses the p-value in a combination of forward selection and backward elimination.
    You can find out more at: NCSS Statistical Software Step-wise Regression write-up
    The lower the p-value the higher the variable's rank.

    p-value vs Attribute plot

    Loading...

    Backward Step-wise Regression - p Value

    Algorithm Results

    Loading...
    This algorithm uses the p-value to conduct backward elimination.
    The variable with the highest p-value above the critical alpha value is eliminated from the model during each iteration.
    You can find out more at: NCSS Statistical Software Backward Step-wise Regression write-up
    The lower the p-value the higher the variable's rank.

    p-value vs Attribute plot

    Loading...

    Forward Step-wise Regression - p Value

    Algorithm Results

    Loading...
    This algorithm uses the p-value to conduct forward selection.
    The variable with the lowest p-value lesser than the critical alpha value is selected to be in the model during each iteration.
    You can find out more at: NCSS Statistical Software Forward Step-wise Regression write-up
    The lower the p-value the higher the variable's rank.

    p-value vs Attribute plot

    Loading...

    Step-wise Regression - AIC

    Algorithm Results

    Loading...
    This algorithm uses the Akaike Information Criterion (AIC) to evaluate if the inclusion/exclusion of certain variables would give the best-fit model.
    AIC captures the trade-off between goodness of fit and complexity of a model.
    You can find out more about AIC at: University of Wisconsin - Explanation on AIC
    The lower the AIC value the higher the variable's ranking

    p-value vs Attribute plot

    Loading...

    Backward Step-wise Regression - AIC

    Algorithm Results

    Loading...
    This algorithm uses the Akaike Information Criterion (AIC) to evaluate if the exclusion/elimination of certain variables would give the best-fit model.
    AIC captures the trade-off between goodness of fit and complexity of a model.
    You can find out more about AIC at: University of Wisconsin - Explanation on AIC
    The lower the AIC value the higher the variable's ranking

    p-value vs Attribute plot

    Loading...

    Forward Step-wise Regression - AIC

    Algorithm Results

    Loading...
    This algorithm uses the Akaike Information Criterion (AIC) to evaluate if the inclusion/elimination of certain variables would give the best-fit model.
    AIC captures the trade-off between goodness of fit and complexity of a model.
    You can find out more about AIC at: University of Wisconsin - Explanation on AIC
    The lower the AIC value the higher the variable's ranking

    p-value vs Attribute plot

    Loading...

    cForest

    Algorithm Results

    Loading...
    cForest uses a Conditional Variable Importance Measure for Random Forests and is based on unbiased conditional inference trees.
    cForest follows the principle of 'mean decrease in accuracy' importance.
    The higher the drop in the mean decrease in accuracy the more significant is the variable.

    Importance vs Attribute plot

    Loading...

    Random Forest for Regression

    Algorithm Results

    Loading...
    Random forests averages out multiple deep decision trees, trained on different parts of the same training set, to overcoming over-fitting problem of an individual decision tree and thereby provide the importance of each variable.
    For Regression, the random Forest importance is based on the 'mean decrease in node impurity' principle which is measured by the residual sum of squares.
    You can find out more about random forests at: A comprehensive write-up about random forests
    The higher the node impurity, the more significant the variable.

    Importance vs Attribute plot

    Loading...

    Lasso Procedure

    Algorithm Results

    Loading...
    Lasso or Least Absolute Shrinkage and Selection Operator, is a procedure that improves the prediction accuracy and interpretability of regression models by altering the model fitting process to select only a subset of the provided covariates.
    Lasso is a continuous subset selection algorithm, that can 'shrink' the effect of unimportant predictors, and thereby can set effects to zero.
    Use lambda.1se when you want to select lambda within 1 standard error of the best model.
    Use lambda.min when you want to select the lambda with minimum mean cross-validated error.
    You can find out more about the Lasso Procedure at: A comprehensive write-up about the lasso net procedure
    Lambda is the weight given to the regularization term (the L1 norm), so as lambda approaches zero, the loss function of your model approaches the OLS loss function. As you increase the L1 norm, variables will enter the model as their coefficients take non-zero values.

    Coefficients vs. L1 Norm Plot

    Loading...

    Elastic Net Procedure

    Algorithm Results

    Loading...
    Elastic net is a hybrid of ridge regression and lasso regularization.
    Like lasso, elastic net can generate reduced models by generating zero-valued coefficients.
    Empirical studies have suggested that the elastic net technique can outperform lasso on data with highly correlated predictors
    Use lambda.1se when you want to select lambda within 1 standard error of the best model. Recommended for Elastic net.
    Use lambda.min when you want to select the lambda with minimum mean cross-validated error.
    You can find out more about the Elastic Net Procedure and its difference from Lasso Procedure at: Stanford University - Slides explaining the elastic net and lasso procedure
    Lambda is the weight given to the regularization term (the L1 norm), so as lambda approaches zero, the loss function of your model approaches the OLS loss function. As you increase the L1 norm, variables will enter the model as their coefficients take non-zero values.

    Coefficients vs. L1 Norm Plot

    Loading...

    Boruta Algorithm

    Algorithm Results

    Loading...
    Boruta algorithm uses variable importance obtained from random forest along with randomisation to obtain the truly important and statistically valid variables for a model.
    As boruta is based on random forest, it too uses the 'mean decrease gini' to calculate the importance of each variable.
    You can find out more about the Boruta Algorithm at: A comprehensive write-up about the boruta algorithm
    The higher the importance the more significant the variable.

    Importance vs. Attributes Norm Plot

    Loading...

    Random Forest for Classification

    Algorithm Results

    Loading...
    Random forests averages out multiple deep decision trees, trained on different parts of the same training set, to overcoming over-fitting problem of an individual decision tree and thereby provide the importance of each variable.
    For Classification, the random Forest importance is based on the 'mean decrease in node impurity' which is measured by the Gini Index
    You can find out more about random forests at: A comprehensive write-up about random forests
    The higher the node impurity or Gini Index, the more significant the variable.

    Importance vs. Attributes Norm Plot

    Loading...