Linear regression from scratch

Joseph El Kettaneh
8 min readJan 10, 2023

--

Linear regression is a statistical method used to model the linear relationship between a dependent variable and one or more independent variables. It is a way to predict the value of the dependent variable based on the values of the independent variables.

In linear regression, the relationship between the independent and dependent variables is represented by a linear equation of the form:

y = b + w1x1 + w2x2 + … + wn*xn

where y is the dependent variable, b is the bias term, and x1, x2, …, xn are the independent variables. The values w1, w2, …, wn are the weights or coefficients that represent the strength of the effect of each independent variable on the dependent variable.

Linear regression can be used to model continuous variables as well as categorical variables that have been encoded as numerical values. It is a widely used method for modeling and predicting quantitative outcomes, and it is relatively simple to implement and interpret.

Real Life example

Sure, here’s a simple example of linear regression:

Imagine you are a real estate agent, and you want to create a linear regression model to predict the sale price of houses in your area based on their square footage. You collect data on the sale prices and square footage of several houses that have recently sold, and you want to use that data to train a model that can predict the sale price of a house based on its square footage.

Here’s some example data that you might have collected:

The goal of linear regression is to find a line that best fits this data, where the “best” line is the one that minimizes the difference between the predicted values (y-hat) and the actual values (y) for each data point.

Using this data, we can create a linear regression model to predict the sale price of a house based on its square footage. We can represent this model mathematically using the following equation:

y = mx + b

Where:

  • y is the sale price of the house
  • x is the square footage of the house
  • m is the slope of the line (the “coefficient” on x)
  • b is the y-intercept of the line (the point where the line crosses the y-axis)

You can use optimization techniques like gradient descent to find the value of m and b that minimizes the difference between the predicted values and the actual values in the data.

Once you have the value of m and b, you can use your model to predict the sale price of a house based on its square footage. For example, if you wanted to predict the sale price of a house with 2000 square feet, you would plug that value into the equation:

y = (m*2000) + b

The result of that equation would be the predicted sale price of the house.

Note that you might need to preprocess the data in order to be able to apply a linear regression model, also linear regression assumes that the relationship between the independent and dependent variable is linear and there is a homoscedastic variance and independence of the errors, otherwise another model may be more suitable.

Let’s implement it from scratch

def fit_linear_regression(X, y):
# Add a column of ones to X
X = np.hstack([np.ones((X.shape[0], 1)), X])

# Solve for the weights using the normal equation
weights = np.linalg.inv(X.T @ X) @ X.T @ y
b = weights[weights.size-1] # just for the info

return weights

def predict(X, weights):
# Add a column of ones to X
X = np.hstack([np.ones((X.shape[0], 1)), X])
h = X @ weights

return h

The function my_linear_regressor takes as input dataset X and a one-dimensional output y.fits a linear model to the given data by finding the optimal weights w and b term b using the normal equation. The normal equation is a closed-form solution to linear regression that finds the weights that minimize the sum of the squared errors between the predicted and true outputs.

The predict function takes as input a matrix X, the weights of a linear model, and a bias term, and returns the predictions of the linear model on the input X. This function also adds a column of ones to X and computes the dot product of X and the weights, plus the bias term, to obtain the predictions.

Evaluation of the model

To evaluate the performance of a linear regression model we can use 2 metrics:

1- The mean squared error (MSE):

The mean squared error is a measure of the average squared difference between the predicted values and the true values.

def mse(y, y_pred):
return np.square(y-y_pred).mean()

2- The coefficient of determination ():

The coefficient of determination (R²) is a measure of how well the linear model fits the data. It is defined as the square of the Pearson correlation coefficient between the true values and the predicted values. R² ranges from 0 to 1, with higher values indicating a better fit. It is defined as follows:

def r2(y, y_pred):
return 1 - np.square(y-y_pred).sum()/np.square(y-np.mean(y)).sum()

Let’s try our model and see how it behaves on a Dataset

The code below loads the Boston Housing dataset from scikit-learn’s datasets module and then splits it into training and test sets using the train_test_split() function from scikit-learn's model_selection module.

The Boston Housing dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It includes 506 observations on 13 different variables. In this case, the ‘data’ key in the ‘data’ dictionary object holds the input features and ‘target’ holds the associated target variable, which is the median value of owner-occupied homes in $1000's.

The train_test_split() function is used to split the data into a training set and a test set. The test_size argument specifies the proportion of the data that should be set aside for testing, and the random_state argument ensures that the data is split in the same way each time the code is run. So, in this case, the code splits the data into a 70/30 train/test split with random_state=42.

from sklearn import datasets

data = datasets.load_boston()

X, y = data['data'],data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, with this code, we’re using the fit_linear_regression() function to fit a linear regression model to the training data and returning the weights and bias term for the line of best fit. Then we're using the predict function and passing the test data, weights and bias term to make predictions on the test data.

# Fit the model to the data
weights = fit_linear_regression(X_train, y_train)
y_pred = predict(X_test, weights)

To evaluate our predictions:

1- By using the Mean Squared Error metric:

mse_ = mse(y_test, y_pred)
print("Mean squared error= ",mse_)

This code calculates the mean squared error(MSE) between the true values of the test set (y_test) and the predicted values produced by the model (y_pred) using the mse() function implemented above

The Mean Squared Error (MSE) is one of the most commonly used metrics for evaluating the performance of a model. It’s calculated by taking the average of the squared differences between the predicted values and the true values. It’s a measure of how well the model fits the data, and a lower MSE value indicates a better fit.

It’s important to note that MSE is sensitive to the scale of the target variable. If the scale of the target variable is large, the MSE will also be large, which can make it difficult to interpret. Additionally, MSE is sensitive to outliers and should not be used if there is presence of any outliers in the data.

2- By using the R-squared metric:

r2_ = r2(y_test, y_pred)
print("R Squared= ", r2_)

This code is calculating the R-squared value between the true values of the test set (y_test) and the predicted values produced by the model (y_pred) using the r2() function.

R-squared is a statistical measure of how close the data are to the fitted regression line. It ranges from 0 to 1, where 1 indicates a perfect fit. In other words, an R-squared value of 1.0 means that all changes in the dependent variable are completely explained by changes in the independent variable(s).

The full code

import numpy as np

def fit_linear_regression(X, y):
# Add a column of ones to X
X = np.hstack([np.ones((X.shape[0], 1)), X])

# Solve for the weights using the normal equation
weights = np.linalg.inv(X.T @ X) @ X.T @ y
b = weights[weights.size-1]

return weights

def predict(X, weights):
# Add a column of ones to X
X = np.hstack([np.ones((X.shape[0], 1)), X])
h = X @ weights

return h

def mse(y, y_pred):
return np.square(y-y_pred).mean()

def r2(y, y_pred):
return 1 - np.square(y-y_pred).sum()/np.square(y-np.mean(y)).sum()


from sklearn import datasets
from sklearn.model_selection import train_test_split

data = datasets.load_boston()

X, y = data['data'],data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the model to the data
weights= fit_linear_regression(X_train, y_train)
y_pred = predict(X_test, weights)

mse_ = mse(y_test, y_pred)
print("Mean squared error= ",mse_)

r2_ = r2(y_test, y_pred)
print("R Squared= ", r2_)

# check How close the predicted values are to the real ones
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(y_pred, 'b.', marker='*')
ax.plot(y_test, 'r.')
plt.legend(['Predictions','Real Values'])

fig, ax = plt.subplots()
ax.scatter(y_test, y_pred, c=['#087E8B'],edgecolors=(1, 1, 1),s=30)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2,label='Best fit line')
ax.plot([y_pred.min(), y_pred.max()], [y_pred.min(), y_pred.max()], 'b--', lw=2,label='Prediction line')
plt.title('Best fit line', size=20)
ax.set_xlabel('Observed data',size=12)
ax.set_ylabel('Predicted data',size=12)
plt.legend()
plt.show()


The results if you run the code

Mean squared error=  21.517444231176263
R Squared= 0.7112260057485059

We can see that the model predicted some good values, but for others the predictions were a bit far from the real ones, this is because we didn’t do any hyper-parameter tuning on the model we kept it basic but it did well all in all.

Thank you !

--

--