LASSO Regression & RIDGE Regression
--- by gzYou
Photo by Safar Safarov on Unsplash

引入

首先,我们随机生成一组数据 \begin{equation}\label{eq:1} y = \sin(x) + \epsilon \\ x \in [1\pi/3,5\pi/3] \end{equation} $\epsilon$是一个随机扰动

In [1]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt

%matplotlib inline
from matplotlib.pylab import rcParams

rcParams["figure.figsize"] = 12, 10

x = np.array([i * np.pi / 180 for i in range(60, 300, 4)])
np.random.seed(10)
y = np.sin(x) + np.random.normal(0, 0.15, len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')
Out[1]:
[<matplotlib.lines.Line2D at 0x16bb1ea1c18>]

因为噪点的存在,数据分布并不完全服从$\sin$曲线。现在,我们准备使用多项式回归去拟合数据分布,我们依次将$x$的1阶到15阶加入到数据中 \begin{equation} \label{eq:2} y = \theta _1 x^1 + ... + \theta _n x^n \\ n = 1,...,15 \end{equation}

In [2]:
for i in range(2, 16):
    colname = "x_%d" % i
    data[colname] = data["x"] ** i
print(data.head())
          x         y       x_2       x_3       x_4       x_5       x_6  \
0  1.047198  1.065763  1.096623  1.148381  1.202581  1.259340  1.318778   
1  1.117011  1.006086  1.247713  1.393709  1.556788  1.738948  1.942424   
2  1.186824  0.695374  1.408551  1.671702  1.984016  2.354677  2.794587   
3  1.256637  0.949799  1.579137  1.984402  2.493673  3.133642  3.937850   
4  1.326450  1.063496  1.759470  2.333850  3.095735  4.106339  5.446854   

        x_7       x_8        x_9       x_10       x_11       x_12       x_13  \
0  1.381021  1.446202   1.514459   1.585938   1.660790   1.739176   1.821260   
1  2.169709  2.423588   2.707173   3.023942   3.377775   3.773011   4.214494   
2  3.316683  3.936319   4.671717   5.544505   6.580351   7.809718   9.268760   
3  4.948448  6.218404   7.814277   9.819710  12.339811  15.506664  19.486248   
4  7.224981  9.583578  12.712139  16.862020  22.366630  29.668222  39.353420   

        x_14       x_15  
0   1.907219   1.997235  
1   4.707635   5.258479  
2  11.000386  13.055521  
3  24.487142  30.771450  
4  52.200353  69.241170  

我们建立15个不同的线性回归模型,第$i$个模型包含从$x^1$到$x^i$变量。例如,第8个模型就包括 $\{x^1, x^2, x^3,…,x^8\}$。

In [4]:
from sklearn.linear_model import LinearRegression


def linear_regression(data, power, models_to_plot):
    r"""returns a list containing – [ model RSS, intercept, coef_x, coef_x2, … upto entered power ]. 
    ----------
    power : int
        Maximum power of x
    
    RSS :float
        Sum of square of errors between the predicted and actual values in the training data set
        
    models_to_plot: dict
        A dict containg the model to be ploted
    
    Returns
    -------
    list
         [ model RSS, intercept, coef_x, coef_x2, … upto entered power ]
    
    """
    # initialize predictors:
    predictors = ["x"]
    if power >= 2:
        predictors.extend(["x_%d" % i for i in range(2, power + 1)])

    # Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors], data["y"])
    y_pred = linreg.predict(data[predictors])

    # Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data["x"], y_pred)
        plt.plot(data["x"], data["y"], ".")
        plt.title("Plot for power: %d" % power)

    # Return the result in pre-defined format
    rss = sum((y_pred - data["y"]) ** 2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret
In [5]:
# Initialize a dataframe to store the results:
col = ["rss", "intercept"] + ["coef_x_%d" % i for i in range(1, 16)]
ind = ["model_pow_%d" % i for i in range(1, 16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

# Define the powers for which a plot is required
models_to_plot = {1: 231, 3: 232, 6: 233, 9: 234, 12: 235, 15: 236}

# Iterate through all powers and assimilate results
for i in range(1, 16):
    coef_matrix_simple.iloc[i - 1, 0 : i + 2] = linear_regression(
        data, power=i, models_to_plot=models_to_plot
    )

我们发现,随着$x$阶数的增加,模型力图去拟合所有的数据点,尽管这样使得拟合的精确度越来越高,但是根据我们的经验,我们知道,模型已经过拟合。

这是为什么呢?我们看下刚刚模型回归的系数:

In [6]:
# Set the display format to be scientific for ease of analysis
pd.options.display.float_format = "{:,.2g}".format
coef_matrix_simple
Out[6]:
rss intercept coef_x_1 coef_x_2 coef_x_3 coef_x_4 coef_x_5 coef_x_6 coef_x_7 coef_x_8 coef_x_9 coef_x_10 coef_x_11 coef_x_12 coef_x_13 coef_x_14 coef_x_15
model_pow_1 3.3 2 -0.62 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_2 3.3 1.9 -0.58 -0.006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_3 1.1 -1.1 3 -1.3 0.14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_4 1.1 -0.27 1.7 -0.53 -0.036 0.014 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_5 1 3 -5.1 4.7 -1.9 0.33 -0.021 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_6 0.99 -2.8 9.5 -9.7 5.2 -1.6 0.23 -0.014 NaN NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_7 0.93 19 -56 69 -45 17 -3.5 0.4 -0.019 NaN NaN NaN NaN NaN NaN NaN NaN
model_pow_8 0.92 43 -1.4e+02 1.8e+02 -1.3e+02 58 -15 2.4 -0.21 0.0077 NaN NaN NaN NaN NaN NaN NaN
model_pow_9 0.87 1.7e+02 -6.1e+02 9.6e+02 -8.5e+02 4.6e+02 -1.6e+02 37 -5.2 0.42 -0.015 NaN NaN NaN NaN NaN NaN
model_pow_10 0.87 1.4e+02 -4.9e+02 7.3e+02 -6e+02 2.9e+02 -87 15 -0.81 -0.14 0.026 -0.0013 NaN NaN NaN NaN NaN
model_pow_11 0.87 -75 5.1e+02 -1.3e+03 1.9e+03 -1.6e+03 9.1e+02 -3.5e+02 91 -16 1.8 -0.12 0.0034 NaN NaN NaN NaN
model_pow_12 0.87 -3.4e+02 1.9e+03 -4.4e+03 6e+03 -5.2e+03 3.1e+03 -1.3e+03 3.8e+02 -80 12 -1.1 0.062 -0.0016 NaN NaN NaN
model_pow_13 0.86 3.2e+03 -1.8e+04 4.5e+04 -6.7e+04 6.6e+04 -4.6e+04 2.3e+04 -8.5e+03 2.3e+03 -4.5e+02 62 -5.7 0.31 -0.0078 NaN NaN
model_pow_14 0.79 2.4e+04 -1.4e+05 3.8e+05 -6.1e+05 6.6e+05 -5e+05 2.8e+05 -1.2e+05 3.7e+04 -8.5e+03 1.5e+03 -1.8e+02 15 -0.73 0.017 NaN
model_pow_15 0.7 -3.6e+04 2.4e+05 -7.5e+05 1.4e+06 -1.7e+06 1.5e+06 -1e+06 5e+05 -1.9e+05 5.4e+04 -1.2e+04 1.9e+03 -2.2e+02 17 -0.81 0.018

一个很明显的趋势是,系数的量级从个位数增加到了十万甚至百万量级,直觉上讲,就是大的系数可以对$x$的微小变动进行放大,然后通过正负项的叠加尽量把包括噪声在内的每个数据点都拟合上。

这样的问题(过拟合)怎么解决呢?直觉的解决办法就是引入惩罚机制,比如$L1$和$L2$正则化。LASSORIDGE就是使用了这些方法。

LASSO Regression

LASSO全称Least Absolute Shrinkage and Selection Operator,它使用了$L1$正则化(L1 Regularization)

它在之前目标函数的基础上,增加了对系数绝对值的惩罚: \begin{equation}\label{eq:3} argmin(y-wx)^2 + \alpha|w| \end{equation} 这样做的目标是想达到一个平衡:第一,拟合的误差要小;第二,系数的绝对值不能太大,模型的结构不能太复杂

$\alpha$可以取不同的值:

  • $\alpha = 0$:与简单线性回归获得的系数相同.
  • $\alpha = \infty$:所有系数变为0.
  • $ 0 < \alpha < \infty$:系数介于前两种情况之间.

现在使用LASSO回归去拟合相同的数据:

In [9]:
from sklearn.linear_model import Lasso


def lasso_regression(data, predictors, alpha, models_to_plot={}):
    lassoreg = Lasso(alpha=alpha, normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors], data["y"])
    y_pred = lassoreg.predict(data[predictors])

    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data["x"], y_pred)
        plt.plot(data["x"], data["y"], ".")
        plt.title("Plot for alpha: %d" % alpha)

    rss = sum((y_pred - data["y"]) ** 2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret
In [10]:
predictors = ["x"]
predictors.extend(["x_%d" % i for i in range(2, 16)])

# Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10]

# Initialize the dataframe to store coefficients
col = ["rss", "intercept"] + ["coef_x_%d" % i for i in range(1, 16)]
ind = ["alpha_%.2g" % alpha_lasso[i] for i in range(0, 10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)
# Define the models to plot
models_to_plot = {1e-10: 231, 1e-5: 232, 1e-4: 233, 1e-3: 234, 1e-2: 235, 1: 236}

# Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(
        data, predictors, alpha_lasso[i], models_to_plot
    )
c:\users\you\anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
In [11]:
coef_matrix_lasso
Out[11]:
rss intercept coef_x_1 coef_x_2 coef_x_3 coef_x_4 coef_x_5 coef_x_6 coef_x_7 coef_x_8 coef_x_9 coef_x_10 coef_x_11 coef_x_12 coef_x_13 coef_x_14 coef_x_15
alpha_1e-15 0.96 0.22 1.1 -0.37 0.00089 0.0016 -0.00012 -6.4e-05 -6.3e-06 1.4e-06 7.8e-07 2.1e-07 4e-08 5.4e-09 1.8e-10 -2e-10 -9.2e-11
alpha_1e-10 0.96 0.22 1.1 -0.37 0.00088 0.0016 -0.00012 -6.4e-05 -6.3e-06 1.4e-06 7.8e-07 2.1e-07 4e-08 5.4e-09 1.8e-10 -2e-10 -9.2e-11
alpha_1e-08 0.96 0.22 1.1 -0.37 0.00077 0.0016 -0.00011 -6.4e-05 -6.3e-06 1.4e-06 7.8e-07 2.1e-07 4e-08 5.3e-09 2e-10 -1.9e-10 -9.3e-11
alpha_1e-05 0.96 0.5 0.6 -0.13 -0.038 -0 0 0 0 7.7e-06 1e-06 7.7e-08 0 0 0 -0 -7e-11
alpha_0.0001 1 0.9 0.17 -0 -0.048 -0 -0 0 0 9.5e-06 5.1e-07 0 0 0 -0 -0 -4.4e-11
alpha_0.001 1.7 1.3 -0 -0.13 -0 -0 -0 0 0 0 0 0 1.5e-08 7.5e-10 0 0 0
alpha_0.01 3.6 1.8 -0.55 -0.00056 -0 -0 -0 -0 -0 -0 -0 0 0 0 0 0 0
alpha_1 37 0.038 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0
alpha_5 37 0.038 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0
alpha_10 37 0.038 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0 -0

我们发现有这么几个特点:

  • 随着惩罚系数$\alpha$的增加,越来越多的系数变成了0.
  • 系数控制在了合理区间内.
  • 随着惩罚系数$\alpha$的增加,误差变小又变大,最终出现了欠拟合问题.

RIDGE Regression

顺便说下RIDGE,它使用了$𝐿2$正则化(L2 Regularization),它的目标函数是

\begin{equation}\label{eq:4} argmin(y-wx)^2 + \alpha w^2 \end{equation}

我们使用RIDGE对同样的数据进行拟合,处理过程与LASSO相似:

In [12]:
from sklearn.linear_model import Ridge


def ridge_regression(data, predictors, alpha, models_to_plot={}):
    # Fit the model
    ridgereg = Ridge(alpha=alpha, normalize=True)
    ridgereg.fit(data[predictors], data["y"])
    y_pred = ridgereg.predict(data[predictors])

    # Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data["x"], y_pred)
        plt.plot(data["x"], data["y"], ".")
        plt.title("Plot for alpha: %.3g" % alpha)

    # Return the result in pre-defined format
    rss = sum((y_pred - data["y"]) ** 2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret


# Initialize predictors to be set of 15 powers of x
predictors = ["x"]
predictors.extend(["x_%d" % i for i in range(2, 16)])

# Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3, 1e-2, 1, 5, 10, 20]

# Initialize the dataframe for storing coefficients.
col = ["rss", "intercept"] + ["coef_x_%d" % i for i in range(1, 16)]
ind = ["alpha_%.2g" % alpha_ridge[i] for i in range(0, 10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15: 231, 1e-10: 232, 1e-4: 233, 1e-3: 234, 1e-2: 235, 5: 236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(
        data, predictors, alpha_ridge[i], models_to_plot
    )
c:\users\you\anaconda3\lib\site-packages\scipy\linalg\basic.py:40: RuntimeWarning: scipy.linalg.solve
Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number/precision: 3.493840061759174e-17 / 1.1102230246251565e-16
  RuntimeWarning)
In [13]:
coef_matrix_ridge
Out[13]:
rss intercept coef_x_1 coef_x_2 coef_x_3 coef_x_4 coef_x_5 coef_x_6 coef_x_7 coef_x_8 coef_x_9 coef_x_10 coef_x_11 coef_x_12 coef_x_13 coef_x_14 coef_x_15
alpha_1e-15 0.87 94 -3e+02 3.8e+02 -2.3e+02 65 0.56 -4.3 0.39 0.2 -0.028 -0.0069 0.0012 0.00019 -5.6e-05 4.1e-06 -7.8e-08
alpha_1e-10 0.92 11 -29 31 -15 2.9 0.17 -0.091 -0.011 0.002 0.00064 2.4e-05 -2e-05 -4.2e-06 2.2e-07 2.3e-07 -2.3e-08
alpha_1e-08 0.95 1.3 -1.5 1.7 -0.68 0.039 0.016 0.00016 -0.00036 -5.4e-05 -2.9e-07 1.1e-06 1.9e-07 2e-08 3.9e-09 8.2e-10 -4.6e-10
alpha_0.0001 0.96 0.56 0.55 -0.13 -0.026 -0.0028 -0.00011 4.1e-05 1.5e-05 3.7e-06 7.4e-07 1.3e-07 1.9e-08 1.9e-09 -1.3e-10 -1.5e-10 -6.2e-11
alpha_0.001 1 0.82 0.31 -0.087 -0.02 -0.0028 -0.00022 1.8e-05 1.2e-05 3.4e-06 7.3e-07 1.3e-07 1.9e-08 1.7e-09 -1.5e-10 -1.4e-10 -5.2e-11
alpha_0.01 1.4 1.3 -0.088 -0.052 -0.01 -0.0014 -0.00013 7.2e-07 4.1e-06 1.3e-06 3e-07 5.6e-08 9e-09 1.1e-09 4.3e-11 -3.1e-11 -1.5e-11
alpha_1 5.6 0.97 -0.14 -0.019 -0.003 -0.00047 -7e-05 -9.9e-06 -1.3e-06 -1.4e-07 -9.3e-09 1.3e-09 7.8e-10 2.4e-10 6.2e-11 1.4e-11 3.2e-12
alpha_5 14 0.55 -0.059 -0.0085 -0.0014 -0.00024 -4.1e-05 -6.9e-06 -1.1e-06 -1.9e-07 -3.1e-08 -5.1e-09 -8.2e-10 -1.3e-10 -2e-11 -3e-12 -4.2e-13
alpha_10 18 0.4 -0.037 -0.0055 -0.00095 -0.00017 -3e-05 -5.2e-06 -9.2e-07 -1.6e-07 -2.9e-08 -5.1e-09 -9.1e-10 -1.6e-10 -2.9e-11 -5.1e-12 -9.1e-13
alpha_20 23 0.28 -0.022 -0.0034 -0.0006 -0.00011 -2e-05 -3.6e-06 -6.6e-07 -1.2e-07 -2.2e-08 -4e-09 -7.5e-10 -1.4e-10 -2.5e-11 -4.7e-12 -8.7e-13

它的特点是:

  • 随着惩罚系数$\alpha$的增加,越来越多的系数会变得很小,但不会到0.
  • 随着惩罚系数$\alpha$的不断增加,最终也会出现欠拟合的问题,但相比于同水平下的LASSO,RIDGE的欠拟合程度会轻一些.
  • 因为系数一直不到0,所以没有办法做变量选择.

如何选择合适的$\alpha$呢?

LASSO和RIDGE一般用赤池信息准则(AIC)或贝叶斯信息准则(BIC)来选择$\alpha$,AIC大致是关于惩罚力度的U型函数,条件形同的情况下AIC越小越好,直接选取AIC最低点对应的惩罚力度$\alpha$。

总结

  • Key Difference
    • LASSO
      - LASSO可以对特征进行选择,就像我们观察到的那样,很多系数变成了0。
    • RIDGE
      - 它包含所有特征。
  • Typical Use Cases
    • RIDGE
      - 主要用于防止过拟合。因为它包含了所有的特征,所以在高维度的情况下(比如以百万计),它不是很有用,因为计算量太大。
    • LASSO
      - 因为它提供了稀疏解,所以它经常在高维度数据建模中使用。在这种情况下,得到一个稀疏解具有很大的计算优势,因为零系数的特征可以简单地忽略。