LASSO Regression & RIDGE Regression

--- by gzYou

Photo by Safar Safarov on Unsplash

引入¶

首先，我们随机生成一组数据 \begin{equation}\label{eq:1} y = \sin(x) + \epsilon \\ x \in [1\pi/3,5\pi/3] \end{equation} $\epsilon$是一个随机扰动

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt

%matplotlib inline
from matplotlib.pylab import rcParams

rcParams["figure.figsize"] = 12, 10

x = np.array([i * np.pi / 180 for i in range(60, 300, 4)])
np.random.seed(10)
y = np.sin(x) + np.random.normal(0, 0.15, len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')

[<matplotlib.lines.Line2D at 0x16bb1ea1c18>]

因为噪点的存在，数据分布并不完全服从$\sin$曲线。现在，我们准备使用多项式回归去拟合数据分布，我们依次将$x$的1阶到15阶加入到数据中 \begin{equation} \label{eq:2} y = \theta _1 x^1 + ... + \theta _n x^n \\ n = 1,...,15 \end{equation}

for i in range(2, 16):
    colname = "x_%d" % i
    data[colname] = data["x"] ** i
print(data.head())

          x         y       x_2       x_3       x_4       x_5       x_6  \
0  1.047198  1.065763  1.096623  1.148381  1.202581  1.259340  1.318778   
1  1.117011  1.006086  1.247713  1.393709  1.556788  1.738948  1.942424   
2  1.186824  0.695374  1.408551  1.671702  1.984016  2.354677  2.794587   
3  1.256637  0.949799  1.579137  1.984402  2.493673  3.133642  3.937850   
4  1.326450  1.063496  1.759470  2.333850  3.095735  4.106339  5.446854   

        x_7       x_8        x_9       x_10       x_11       x_12       x_13  \
0  1.381021  1.446202   1.514459   1.585938   1.660790   1.739176   1.821260   
1  2.169709  2.423588   2.707173   3.023942   3.377775   3.773011   4.214494   
2  3.316683  3.936319   4.671717   5.544505   6.580351   7.809718   9.268760   
3  4.948448  6.218404   7.814277   9.819710  12.339811  15.506664  19.486248   
4  7.224981  9.583578  12.712139  16.862020  22.366630  29.668222  39.353420   

        x_14       x_15  
0   1.907219   1.997235  
1   4.707635   5.258479  
2  11.000386  13.055521  
3  24.487142  30.771450  
4  52.200353  69.241170

我们建立15个不同的线性回归模型，第$i$个模型包含从$x^1$到$x^i$变量。例如，第8个模型就包括 $\{x^1, x^2, x^3，…，x^8\}$。

from sklearn.linear_model import LinearRegression


def linear_regression(data, power, models_to_plot):
    r"""returns a list containing – [ model RSS, intercept, coef_x, coef_x2, … upto entered power ]. 
    ----------
    power : int
        Maximum power of x
    
    RSS ：float
        Sum of square of errors between the predicted and actual values in the training data set
        
    models_to_plot: dict
        A dict containg the model to be ploted
    
    Returns
    -------
    list
         [ model RSS, intercept, coef_x, coef_x2, … upto entered power ]
    
    """
    # initialize predictors:
    predictors = ["x"]
    if power >= 2:
        predictors.extend(["x_%d" % i for i in range(2, power + 1)])

    # Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors], data["y"])
    y_pred = linreg.predict(data[predictors])

    # Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data["x"], y_pred)
        plt.plot(data["x"], data["y"], ".")
        plt.title("Plot for power: %d" % power)

    # Return the result in pre-defined format
    rss = sum((y_pred - data["y"]) ** 2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret

# Initialize a dataframe to store the results:
col = ["rss", "intercept"] + ["coef_x_%d" % i for i in range(1, 16)]
ind = ["model_pow_%d" % i for i in range(1, 16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

# Define the powers for which a plot is required
models_to_plot = {1: 231, 3: 232, 6: 233, 9: 234, 12: 235, 15: 236}

# Iterate through all powers and assimilate results
for i in range(1, 16):
    coef_matrix_simple.iloc[i - 1, 0 : i + 2] = linear_regression(
        data, power=i, models_to_plot=models_to_plot
    )

我们发现，随着$x$阶数的增加，模型力图去拟合所有的数据点，尽管这样使得拟合的精确度越来越高，但是根据我们的经验，我们知道，模型已经过拟合。

这是为什么呢？我们看下刚刚模型回归的系数：

# Set the display format to be scientific for ease of analysis
pd.options.display.float_format = "{:,.2g}".format
coef_matrix_simple

一个很明显的趋势是，系数的量级从个位数增加到了十万甚至百万量级，直觉上讲，就是大的系数可以对$x$的微小变动进行放大，然后通过正负项的叠加尽量把包括噪声在内的每个数据点都拟合上。

这样的问题（过拟合）怎么解决呢？直觉的解决办法就是引入惩罚机制，比如$L1$和$L2$正则化。LASSO和RIDGE就是使用了这些方法。

LASSO Regression¶

LASSO全称Least Absolute Shrinkage and Selection Operator，它使用了$L1$正则化(L1 Regularization)。

它在之前目标函数的基础上，增加了对系数绝对值的惩罚： \begin{equation}\label{eq:3} argmin(y-wx)^2 + \alpha|w| \end{equation} 这样做的目标是想达到一个平衡：第一，拟合的误差要小；第二，系数的绝对值不能太大，模型的结构不能太复杂。

$\alpha$可以取不同的值：

$\alpha = 0$:与简单线性回归获得的系数相同.
$\alpha = \infty$:所有系数变为0.
$ 0 < \alpha < \infty$:系数介于前两种情况之间.

现在使用LASSO回归去拟合相同的数据：

from sklearn.linear_model import Lasso


def lasso_regression(data, predictors, alpha, models_to_plot={}):
    lassoreg = Lasso(alpha=alpha, normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors], data["y"])
    y_pred = lassoreg.predict(data[predictors])

    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data["x"], y_pred)
        plt.plot(data["x"], data["y"], ".")
        plt.title("Plot for alpha: %d" % alpha)

    rss = sum((y_pred - data["y"]) ** 2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

predictors = ["x"]
predictors.extend(["x_%d" % i for i in range(2, 16)])

# Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5, 1e-4, 1e-3, 1e-2, 1, 5, 10]

# Initialize the dataframe to store coefficients
col = ["rss", "intercept"] + ["coef_x_%d" % i for i in range(1, 16)]
ind = ["alpha_%.2g" % alpha_lasso[i] for i in range(0, 10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)
# Define the models to plot
models_to_plot = {1e-10: 231, 1e-5: 232, 1e-4: 233, 1e-3: 234, 1e-2: 235, 1: 236}

# Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(
        data, predictors, alpha_lasso[i], models_to_plot
    )

c:\users\you\anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

coef_matrix_lasso

我们发现有这么几个特点：

随着惩罚系数$\alpha$的增加，越来越多的系数变成了0.
系数控制在了合理区间内.
随着惩罚系数$\alpha$的增加，误差变小又变大，最终出现了欠拟合问题.

RIDGE Regression¶

顺便说下RIDGE，它使用了$𝐿2$正则化(L2 Regularization),它的目标函数是

\begin{equation}\label{eq:4} argmin(y-wx)^2 + \alpha w^2 \end{equation}

我们使用RIDGE对同样的数据进行拟合，处理过程与LASSO相似：

from sklearn.linear_model import Ridge


def ridge_regression(data, predictors, alpha, models_to_plot={}):
    # Fit the model
    ridgereg = Ridge(alpha=alpha, normalize=True)
    ridgereg.fit(data[predictors], data["y"])
    y_pred = ridgereg.predict(data[predictors])

    # Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data["x"], y_pred)
        plt.plot(data["x"], data["y"], ".")
        plt.title("Plot for alpha: %.3g" % alpha)

    # Return the result in pre-defined format
    rss = sum((y_pred - data["y"]) ** 2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret


# Initialize predictors to be set of 15 powers of x
predictors = ["x"]
predictors.extend(["x_%d" % i for i in range(2, 16)])

# Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3, 1e-2, 1, 5, 10, 20]

# Initialize the dataframe for storing coefficients.
col = ["rss", "intercept"] + ["coef_x_%d" % i for i in range(1, 16)]
ind = ["alpha_%.2g" % alpha_ridge[i] for i in range(0, 10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15: 231, 1e-10: 232, 1e-4: 233, 1e-3: 234, 1e-2: 235, 5: 236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(
        data, predictors, alpha_ridge[i], models_to_plot
    )

c:\users\you\anaconda3\lib\site-packages\scipy\linalg\basic.py:40: RuntimeWarning: scipy.linalg.solve
Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number/precision: 3.493840061759174e-17 / 1.1102230246251565e-16
  RuntimeWarning)

coef_matrix_ridge

它的特点是：

随着惩罚系数$\alpha$的增加，越来越多的系数会变得很小，但不会到0.
随着惩罚系数$\alpha$的不断增加，最终也会出现欠拟合的问题，但相比于同水平下的LASSO,RIDGE的欠拟合程度会轻一些.
因为系数一直不到0，所以没有办法做变量选择.

如何选择合适的$\alpha$呢？

LASSO和RIDGE一般用赤池信息准则(AIC)或贝叶斯信息准则(BIC)来选择$\alpha$，AIC大致是关于惩罚力度的U型函数，条件形同的情况下AIC越小越好，直接选取AIC最低点对应的惩罚力度$\alpha$。

总结¶

Key Difference

LASSO

- LASSO可以对特征进行选择，就像我们观察到的那样，很多系数变成了0。

RIDGE
```
- 它包含所有特征。
```

Typical Use Cases

RIDGE

- 主要用于防止过拟合。因为它包含了所有的特征，所以在高维度的情况下(比如以百万计)，它不是很有用，因为计算量太大。

LASSO

- 因为它提供了稀疏解，所以它经常在高维度数据建模中使用。在这种情况下，得到一个稀疏解具有很大的计算优势，因为零系数的特征可以简单地忽略。

Reference

A Complete Tutorial on Ridge and Lasso Regression in Python - AARSHAY JAIN

Linear least squares, Lasso,ridge regression有何本质区别？ - 童话李的回答 - 知乎

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
model_pow_1	3.3	2	-0.62	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_2	3.3	1.9	-0.58	-0.006	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_3	1.1	-1.1	3	-1.3	0.14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_4	1.1	-0.27	1.7	-0.53	-0.036	0.014	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_5	1	3	-5.1	4.7	-1.9	0.33	-0.021	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_6	0.99	-2.8	9.5	-9.7	5.2	-1.6	0.23	-0.014	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_7	0.93	19	-56	69	-45	17	-3.5	0.4	-0.019	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_8	0.92	43	-1.4e+02	1.8e+02	-1.3e+02	58	-15	2.4	-0.21	0.0077	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_9	0.87	1.7e+02	-6.1e+02	9.6e+02	-8.5e+02	4.6e+02	-1.6e+02	37	-5.2	0.42	-0.015	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_10	0.87	1.4e+02	-4.9e+02	7.3e+02	-6e+02	2.9e+02	-87	15	-0.81	-0.14	0.026	-0.0013	NaN	NaN	NaN	NaN	NaN
model_pow_11	0.87	-75	5.1e+02	-1.3e+03	1.9e+03	-1.6e+03	9.1e+02	-3.5e+02	91	-16	1.8	-0.12	0.0034	NaN	NaN	NaN	NaN
model_pow_12	0.87	-3.4e+02	1.9e+03	-4.4e+03	6e+03	-5.2e+03	3.1e+03	-1.3e+03	3.8e+02	-80	12	-1.1	0.062	-0.0016	NaN	NaN	NaN
model_pow_13	0.86	3.2e+03	-1.8e+04	4.5e+04	-6.7e+04	6.6e+04	-4.6e+04	2.3e+04	-8.5e+03	2.3e+03	-4.5e+02	62	-5.7	0.31	-0.0078	NaN	NaN
model_pow_14	0.79	2.4e+04	-1.4e+05	3.8e+05	-6.1e+05	6.6e+05	-5e+05	2.8e+05	-1.2e+05	3.7e+04	-8.5e+03	1.5e+03	-1.8e+02	15	-0.73	0.017	NaN
model_pow_15	0.7	-3.6e+04	2.4e+05	-7.5e+05	1.4e+06	-1.7e+06	1.5e+06	-1e+06	5e+05	-1.9e+05	5.4e+04	-1.2e+04	1.9e+03	-2.2e+02	17	-0.81	0.018

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
alpha_1e-15	0.96	0.22	1.1	-0.37	0.00089	0.0016	-0.00012	-6.4e-05	-6.3e-06	1.4e-06	7.8e-07	2.1e-07	4e-08	5.4e-09	1.8e-10	-2e-10	-9.2e-11
alpha_1e-10	0.96	0.22	1.1	-0.37	0.00088	0.0016	-0.00012	-6.4e-05	-6.3e-06	1.4e-06	7.8e-07	2.1e-07	4e-08	5.4e-09	1.8e-10	-2e-10	-9.2e-11
alpha_1e-08	0.96	0.22	1.1	-0.37	0.00077	0.0016	-0.00011	-6.4e-05	-6.3e-06	1.4e-06	7.8e-07	2.1e-07	4e-08	5.3e-09	2e-10	-1.9e-10	-9.3e-11
alpha_1e-05	0.96	0.5	0.6	-0.13	-0.038	-0	0	0	0	7.7e-06	1e-06	7.7e-08	0	0	0	-0	-7e-11
alpha_0.0001	1	0.9	0.17	-0	-0.048	-0	-0	0	0	9.5e-06	5.1e-07	0	0	0	-0	-0	-4.4e-11
alpha_0.001	1.7	1.3	-0	-0.13	-0	-0	-0	0	0	0	0	0	1.5e-08	7.5e-10	0	0	0
alpha_0.01	3.6	1.8	-0.55	-0.00056	-0	-0	-0	-0	-0	-0	-0	0	0	0	0	0	0
alpha_1	37	0.038	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_5	37	0.038	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_10	37	0.038	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
alpha_1e-15	0.87	94	-3e+02	3.8e+02	-2.3e+02	65	0.56	-4.3	0.39	0.2	-0.028	-0.0069	0.0012	0.00019	-5.6e-05	4.1e-06	-7.8e-08
alpha_1e-10	0.92	11	-29	31	-15	2.9	0.17	-0.091	-0.011	0.002	0.00064	2.4e-05	-2e-05	-4.2e-06	2.2e-07	2.3e-07	-2.3e-08
alpha_1e-08	0.95	1.3	-1.5	1.7	-0.68	0.039	0.016	0.00016	-0.00036	-5.4e-05	-2.9e-07	1.1e-06	1.9e-07	2e-08	3.9e-09	8.2e-10	-4.6e-10
alpha_0.0001	0.96	0.56	0.55	-0.13	-0.026	-0.0028	-0.00011	4.1e-05	1.5e-05	3.7e-06	7.4e-07	1.3e-07	1.9e-08	1.9e-09	-1.3e-10	-1.5e-10	-6.2e-11
alpha_0.001	1	0.82	0.31	-0.087	-0.02	-0.0028	-0.00022	1.8e-05	1.2e-05	3.4e-06	7.3e-07	1.3e-07	1.9e-08	1.7e-09	-1.5e-10	-1.4e-10	-5.2e-11
alpha_0.01	1.4	1.3	-0.088	-0.052	-0.01	-0.0014	-0.00013	7.2e-07	4.1e-06	1.3e-06	3e-07	5.6e-08	9e-09	1.1e-09	4.3e-11	-3.1e-11	-1.5e-11
alpha_1	5.6	0.97	-0.14	-0.019	-0.003	-0.00047	-7e-05	-9.9e-06	-1.3e-06	-1.4e-07	-9.3e-09	1.3e-09	7.8e-10	2.4e-10	6.2e-11	1.4e-11	3.2e-12
alpha_5	14	0.55	-0.059	-0.0085	-0.0014	-0.00024	-4.1e-05	-6.9e-06	-1.1e-06	-1.9e-07	-3.1e-08	-5.1e-09	-8.2e-10	-1.3e-10	-2e-11	-3e-12	-4.2e-13
alpha_10	18	0.4	-0.037	-0.0055	-0.00095	-0.00017	-3e-05	-5.2e-06	-9.2e-07	-1.6e-07	-2.9e-08	-5.1e-09	-9.1e-10	-1.6e-10	-2.9e-11	-5.1e-12	-9.1e-13
alpha_20	23	0.28	-0.022	-0.0034	-0.0006	-0.00011	-2e-05	-3.6e-06	-6.6e-07	-1.2e-07	-2.2e-08	-4e-09	-7.5e-10	-1.4e-10	-2.5e-11	-4.7e-12	-8.7e-13