Linear Regression Variable Selection

One use case for genetic optimization (GO) is to find a good set of variables for a multiple linear regression model. Once the number of variables in a dataset grows, it becomes increasingly harder to find good combinations of variables and GO can help with that.

One way to measure how good a linear regression model explains some dataset is to look at the Bayesian Information Critereon (BIC) that decreases when the likelihood of the model increases and increases when more variables are added. A regression model with low BIC means we have found a good fit using few variables, which makes sure the model has not overfit the data.

Code Walkthough

We start by importing necessary packages. We use statsmodels for the regression model because it automatically computes the BIC value of each model.

from functools import partial
import statsmodels.api as sm
from sklearn import datasets
import numpy as np
import pandas as pd
from genopt import GeneticOptimizer

Next we need to write the function to optmize. In our case we want to fit a linear regression model on a set of variables that are defined by the chromosome. For this we can use a binary chromosome where 1 means that a variable is included and 0 means that it is excluded. We write a small helper function to fit a regression model on these variables

# Helper function to fit a regression model after a chromosome
def linear_regression_fit_chromosome(chromosome, x_data, targets):
    x_subset = x_data.iloc[:, chromosome.astype(bool)]
    regression_model = sm.OLS(targets, x_subset)
    return regression_model.fit()

Because the GeneticOptimizer will try to maximize the fitness of its population, and we want to minimize the BIC, we define our fitness as -BIC. If all chromosomes are 0 we cannot train a model, so we create an edge case for that, that gives very low fitness.

# Objective function to maximize
def linear_regression_minimize_bic(chromosome, x_data, targets):
    if np.all(chromosome == 0):
        return -1e9

    results = linear_regression_fit_chromosome(chromosome, x_data, targets)
    return -results.bic

In our main function we load the data that we want to use, in this case the Boston housing dataset, which has a small dimensionality for GO, but big enough for the purposes of this example.

We initialize a GeneticOptimizer with a population size of 100 and binary genes. We use the partial function from functools to set default values, so that chromosome is the only argument of the objective function. Then we can start the optimization. Because the dimensionality of the Boston dataset is small we can expect to find the global optimum within a few generations. We can then print the results of the best model.

def main():
    # Load Boston housing dataset
    data = datasets.load_boston()
    x_data = pd.DataFrame(data=data["data"], columns=data["feature_names"])
    targets = data["target"]

    # Setup GeneticOptimizer
    optimizer = GeneticOptimizer(
        n_vars=x_data.shape[1],
        popsize=100,
        objective_function=partial(
            linear_regression_minimize_bic, x_data=x_data, targets=targets
        ),
        encoding="discrete",
        var_range=(0, 1),
        var_size=1,
    )

    best_chromosome = optimizer.optimize(10)

    # Show results of optimization
    results = linear_regression_fit_chromosome(best_chromosome, x_data, targets)
    print(results.summary())


if __name__ == "__main__":
    main()

If we run it we get this output:

2021-04-20 22:29:32,169 - INFO - genopt.go - Generation: 0 - Max fitness: -3118.9878994365486
2021-04-20 22:29:32,237 - INFO - genopt.go - Generation: 1 - Max fitness: -3118.9878994365486
2021-04-20 22:29:32,304 - INFO - genopt.go - Generation: 2 - Max fitness: -3118.9878994365486
2021-04-20 22:29:32,371 - INFO - genopt.go - Generation: 3 - Max fitness: -3118.9878994365486
2021-04-20 22:29:32,437 - INFO - genopt.go - Generation: 4 - Max fitness: -3116.0042335230205
2021-04-20 22:29:32,503 - INFO - genopt.go - Generation: 5 - Max fitness: -3112.7405121573247
2021-04-20 22:29:32,570 - INFO - genopt.go - Generation: 6 - Max fitness: -3112.7405121573247
2021-04-20 22:29:32,638 - INFO - genopt.go - Generation: 7 - Max fitness: -3112.7405121573247
2021-04-20 22:29:32,705 - INFO - genopt.go - Generation: 8 - Max fitness: -3110.0331325837083
2021-04-20 22:29:32,775 - INFO - genopt.go - Generation: 9 - Max fitness: -3110.0331325837083
                                OLS Regression Results
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.958
Model:                            OLS   Adj. R-squared (uncentered):              0.957
Method:                 Least Squares   F-statistic:                              1425.
Date:                Tue, 20 Apr 2021   Prob (F-statistic):                        0.00
Time:                        22:29:32   Log-Likelihood:                         -1530.1
No. Observations:                 506   AIC:                                      3076.
Df Residuals:                     498   BIC:                                      3110.
Df Model:                           8
Covariance Type:            nonrobust
==============================================================================
                coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
CRIM          -0.0799      0.031     -2.545      0.011      -0.142      -0.018
ZN             0.0437      0.014      3.138      0.002       0.016       0.071
CHAS           2.8766      0.898      3.205      0.001       1.113       4.640
RM             5.5964      0.244     22.926      0.000       5.117       6.076
DIS           -0.7761      0.158     -4.904      0.000      -1.087      -0.465
PTRATIO       -0.4881      0.098     -5.000      0.000      -0.680      -0.296
B              0.0140      0.003      5.404      0.000       0.009       0.019
LSTAT         -0.4853      0.041    -11.761      0.000      -0.566      -0.404
==============================================================================
Omnibus:                      189.204   Durbin-Watson:                   1.012
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1085.081
Skew:                           1.526   Prob(JB):                    2.39e-236
Kurtosis:                       9.492   Cond. No.                     1.49e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Full Code

from functools import partial
import statsmodels.api as sm
from sklearn import datasets
import numpy as np
import pandas as pd
from genopt import GeneticOptimizer


# Helper function to fit a regression model after a chromosome
def linear_regression_fit_chromosome(chromosome, x_data, targets):
    x_subset = x_data.iloc[:, chromosome.astype(bool)]
    regression_model = sm.OLS(targets, x_subset)
    return regression_model.fit()


# Objective function to maximize
def linear_regression_minimize_bic(chromosome, x_data, targets):
    if np.all(chromosome == 0):
        return -1e9

    results = linear_regression_fit_chromosome(chromosome, x_data, targets)
    return -results.bic


def main():
    # Load Boston housing dataset
    data = datasets.load_boston()
    x_data = pd.DataFrame(data=data["data"], columns=data["feature_names"])
    targets = data["target"]

    # Setup GeneticOptimizer
    optimizer = GeneticOptimizer(
        n_vars=x_data.shape[1],
        popsize=100,
        objective_function=partial(
            linear_regression_minimize_bic, x_data=x_data, targets=targets
        ),
        encoding="discrete",
        var_range=(0, 1),
        var_size=1,
    )

    best_chromosome = optimizer.optimize(10)

    # Show results of optimization
    results = linear_regression_fit_chromosome(best_chromosome, x_data, targets)
    print(results.summary())


if __name__ == "__main__":
    main()