Linear Regression

Linear regression assumes that features and outcomes satisfy a linear relationship. The expressive ability of linear relationship is strong. The influence of each feature on the result can be reflected by its parameter.

And each feature variable can be mapped to a function first, and then participate in the linear calculation. In this way, a nonlinear relationship between features and results can be expressed.

The linear model can be expressed as \(f(X)=W^{T}X+b\)

\(b\) (bias parameter): compensates for the difference between the mean of the target values and the weighted mean of the basis function values

The goal is that the predicted value from model can be infinitely close to the true value: \(f(x_{i}) \approx y_{i}\)

The reason for we want infinitely close but not exactly the same is that we can only learn part of the data sampled from all the data of a certain type of event, the sampled data cannot cover all the possibilities of the event, so we can only learn the overall law in the end.

import numpy as np

def initialize_params(dims):
    w = np.zeros((dims, 1))
    b = 0
    return w, b

Here we use MSE to measure the difference between \(\hat{y}\) and \(y\): \(\mathrm{MSE}=\frac{1}{n} \sum_{i=1}^{n}\left(f(x_{i})-y_{i}\right)^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(\hat{y}_{i}-y_{i}\right)^{2}\)

The difference can be positive or negative so using square can erase the effect of positive and negative. After the square is used, the error out of -1 and 1 will be enlarged, and the error between -1 and 1 will be reduced at the same time. And it's unable to handle when measures vary widely across dimensions, so we have to normalize the data before modeling.

Derivation Process of Parameter Estimation in Linear Regression Model:

\(\left(w^{*}, b^{*}\right) =\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(f\left(x_{i}\right)-y_{i}\right)^{2} =\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}\)

\(\frac{\partial E_{(w, b)}}{\partial w}=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right), \quad\) \(\frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)\)

\(w=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}}, \quad\) \(b=\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\)

In matrix:

\(w^{\prime} = 2X^{T}(Y-Xw) = 2X^{T}Y-2X^{T}Xw = 0\)

\(2X^{T}Y = 2X^{T}Xw \rightarrow \hat{w} = (X^{T}X)^{-1}X^{T}Y\)

\(X^{T}X\) usually isn't a full-rank matrix in reality so we need to apply regularization on it

\(\hat{w} = (X^{T}X+ \lambda I)^{-1}X^{T}Y\)

def linear_loss(X, y, w, b):
    num_train = X.shape[0]
    num_feature = X.shape[1]
    # model function
    y_hat = np.dot(X, w) + b
    # loss function
    loss = np.sum((y_hat-y)**2) / num_train
    # partial derivatives of parameters
    dw = np.dot(X.T, (y_hat-y)) / num_train
    db = np.sum((y_hat-y)) / num_train
    return y_hat, loss, dw, db

Update parameters based on gradient descent: multiple iterations are required to converge to the global minimum and a proper learning rate is necessary, but under this method, \((X^{T}X)^{-1}\) isn't required to be calculated.

def linear_train(X, y, learning_rate, epochs):
    w, b = initialize_params(X.shape[1])
    loss_list = []
    for i in range(1, epochs):
        # calculate the current estimation, loss, and partial derivatives
        y_hat, loss, dw, db = linear_loss(X, y, w ,b)
        loss_list.append(loss)
        # update parameters based on gradient descent
        w += -learning_rate * dw
        b += -learning_rate * db
        if i % 10000 == 0:
            print('epoch %d loss %f' % (i, loss))

        # save parameters
        params = {'w':w, 'b':b}
        # save gradient
        grads = {'dw':dw, 'db':db}

    return loss_list, loss, params, grads

def predict(X, params): 
    w = params['w']
    b = params['b']
    y_pred = np.dot(X, w) + b 
    return y_pred

from sklearn.datasets import load_diabetes 
from sklearn.utils import shuffle

diabetes = load_diabetes() 
data = diabetes.data 
target = diabetes.target

# shuffle the data
X, y = shuffle(data, target, random_state=13)
X = X.astype(np.float32)

# divide the train and test data set
offset = int(X.shape[0] * 0.9)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
y_train = y_train.reshape((-1,1))
y_test = y_test.reshape((-1,1))

# visualize the prediction and real test result
import matplotlib.pyplot as plt
f = X_test.dot(params['w']) + params['b']
plt.scatter(range(X_test.shape[0]), y_test) 
plt.plot(f, color = 'darkorange') 
plt.xlabel('X')
plt.ylabel('y')
plt.show()

# plot the loss during training
plt.plot(loss_list, color = 'blue') 
plt.xlabel('epochs') 
plt.ylabel('loss')
plt.show()

Data Science

Machine Learning

All articles in this blog adopt the CC BY-SA 4.0 agreement except for special statements. Please indicate the source for reprinting!

Breadth-First Search and Depth-First Search Previous

Computation of Eigenvalues and Eigenvectors Next