4 minute read

Link to Project

As part of my roadmap to becoming a data scientist, I knew I needed to tackle tough projects that would help me build a strong foundation in ML. Something that I could show to others to impress my skills. When I came across a video by Infinite Codes, called 22 Machine Learning Projects That Will Make You A God At Data Science, I found the inspiration and plan to create foundation projects. One foundational project suggested was building Linear Regression from scratch.

During the time I was refreshing my statistical concepts with Khan Academy, I realized that not a lot of time was spent on regression lines and their associated equations. So when I moved onto the Introduction to Statistical Learning in Python course offered for free by Stanford, I was blown away by the pure amount of math behind the statistical concepts I knew. Let’s not joke, I would’ve been blown away regardless, but a stronger foundation/familiarity with the math wouldn’t hurt. So when I came across the idea for this project, I knew I should attempt it to build up a strong ML and statistical foundation. Like Infinite Codes says in their video: “if you can’t understand the basics of how machine learning works, good luck explaining to your boss why your deep learning model is making weird predictions.”

For this project, I chose the dataset that Sci-Kit Learn uses to demonstrate regression: the diabetes dataset. It comes clean, well-structured, and only has numeric data, making it perfect for regression.

Ordinary Least Squares (OLS) Linear Regression

This part wasn’t so terrible. The math is fairly straightforward:

\[m = \frac{S_{xy}}{S_{xx}} = \frac{∑(x_i−x̄)(y_i−ȳ)}{∑(x_i−x̄)^2} = \frac{n∑xy−(∑x∑y)}{n∑x^2−(∑x)^2}\] \[b = ȳ - m x̄\]

Where the formula for the regression line is:

\[y=mx+b\]

While I don’t yet have a deep intuition for why these formulas work mathematically, I understand that they minimize the sum of squared residuals, ensuring the best possible fit for the data.

Using this approach, I was able to perform linear regression with one feature. But I knew that in the real world, multiple features influence the target variable, so I needed to learn how to perform multiple linear regression.

Gradient Descent for Multiple Linear Regression

This part was what I dreaded the most. Conceptually, it’s not that difficult. Here are the steps one takes for Gradient Descent:

  1. Take the derivative of the Loss Function for each parameter in it. (Take the Gradient of the Loss Function)
    • The Loss Function will be Mean Squared Error (MSE)
  2. Pick random values for the parameters
  3. Plug the parameter values into the derivatives (Gradient)
  4. Calculate the step size (step size = slope * learning rate)
  5. Calculate new parameters (new parameter = old parameter - step size)
  6. Repeat 3-5 until stopping condition met (number of iterations or convergence threshold)

Mathematical Formulation

The cost function (Mean Squared Error, MSE) is:

\[J(\theta) = \frac{1}{2m} \sum*{i=1}^{m} (h*{\theta}(x^{(i)}) - y^{(i)})^2\]

Where:

  • \(h\_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n\) (the prediction function)
  • \(m\) is the number of training examples

The gradient update rule for each parameter \(θj\theta_j\) is:

\[\theta*j := \theta_j - \alpha \frac{1}{m} \sum*{i=1}^{m} (h\_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}\]

And in vectorized form:

\[\theta := \theta - \alpha \frac{1}{m} X^T (X\theta - y)\]

This formulation allowed me to efficiently compute updates for all parameters at once using NumPy.

Evaluating the Regression Lines

To evaluate the models trained using OLS and Gradient Descent, I implemented a regression summary function. This function calculated Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-Squared, and Adjusted R-Squared as follows:

n = len(X)
preds = predict(X, m, b)
residuals = [y_n - y_hat for y_n, y_hat in zip(y, preds)]

# R-squared
ss_total = sum((y_n - y.mean())**2 for y_n in y)
ss_residual = sum(r**2 for r in residuals)
r_squared = 1 - (ss_residual / ss_total)

# Adjusted R-squared
k = X.shape[1]  # Number of predictors
adjusted_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - k - 1)

# MSE and RMSE
mse = ss_residual / n
rmse = mse ** 0.5

OLS vs. Gradient Descent: A Comparison

Method Pros Cons
Ordinary Least Squares Exact solution, interpretable Slow for large datasets
Gradient Descent Scalable, works with large datasets Requires tuning learning rate, iterative

Challenges and Learning

By far the biggest challenge in this project was Gradient Descent. As mentioned previously, I have rarely, if ever, worked with matrix math at this depth. So figuring out and understanding how transposing and multiplying matrices worked seemed daunting at first.

But after spending time researching and reading through online college material, I gained a better understanding and was able to implement the equations I found into my work. Luckily, my work implementing OLS and the regression summary function involved similar research, looking up equations and applying them.

While my grasp of these equations is not yet at the level I find sufficient, I’m proud that I was able to implement them and gain a better understanding in the process.

Conclusion

This project reinforced the importance of understanding fundamental machine learning math. Linear regression is one of the most widely used models in statistics and ML, and being able to implement it from scratch has given me deeper insight into how models work under the hood.

In the future, I will continue focusing on the foundational math behind statistical learning and revisit this project as I learn more related concepts. For any aspiring data scientist, I highly recommend trying to implement linear regression from scratch—it’s an invaluable learning experience!