Ch 3: Linear Algebra & Calculus - Advanced¶

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-03-linear-algebra/notebooks/03_advanced.ipynb in Jupyter.

Chapter 3: Linear Algebra & Calculus¶

Notebook 03 - Advanced¶

Calculus powers optimization in machine learning. Every gradient descent step, every backpropagation pass relies on derivatives.

What you'll learn: - Derivatives and partial derivatives - Gradients and the gradient descent algorithm - Chain rule and backpropagation intuition - Capstone: implement gradient descent for linear regression from scratch

Time estimate: 3.5 hours

Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Why Calculus Matters for AI¶

Training a model = minimizing a loss function. We adjust parameters to make loss smaller.

Derivative: how fast does f(x) change as x changes? Slope of the tangent line.
Gradient: vector of partial derivatives — points in direction of steepest ascent.
Gradient descent: step opposite to gradient to minimize loss.

2. Derivatives: Numerical Approximation¶

The derivative tells you how fast something is changing. Think of a speedometer: it doesn't measure where you are, it measures how quickly your position changes. The derivative f'(x) is the slope of the tangent line at x—how steep the function is at that point. If f'(x) > 0, f is increasing; if f'(x) < 0, f is decreasing. In ML, we minimize loss: we want to know "if I tweak this weight slightly, does the loss go up or down?" The derivative answers that.

Definition: \(f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}\)

We can approximate with small h (finite differences). Useful for checking hand-derived gradients.

def derivative_numerical(f, x, h=1e-5):
    """Approximate f'(x) using finite differences."""
    return (f(x + h) - f(x - h)) / (2 * h)


# f(x) = x^2  =>  f'(x) = 2x
f = lambda x: x ** 2
x_val = 3
approx = derivative_numerical(f, x_val)
exact = 2 * x_val

print(f"f(x) = x², f'(3) = 2*3 = 6")
print(f"  Numerical approx: {approx:.6f}")
print(f"  Exact:           {exact}")

3. Partial Derivatives¶

For f(x,y), partial derivative ∂f/∂x treats y as constant.

Example: L(w,b) = (wx + b - y)² (squared error for one sample) - ∂L/∂w = 2(wx + b - y) · x - ∂L/∂b = 2(wx + b - y) · 1

What just happened¶

We approximated f'(3) for f(x)=x² using (f(x+h) - f(x-h))/(2h) with h=1e-5. The result matches the exact derivative 2x=6. This "central difference" method is more accurate than one-sided differences. When you implement custom gradients, comparing to numerical approximations helps catch bugs.

Partial derivatives: like asking how steep a hill is if you only walk north. With multiple variables, ∂f/∂x treats y as constant—you're slicing the landscape along one axis. Each partial tells you how to adjust one weight. In ML, we need all of them: the gradient is the vector of partial derivatives.

def partial_numerical(f, idx, point, h=1e-5):
    """Approximate partial derivative of f at point w.r.t. dimension idx."""
    point_plus = point[:idx] + [point[idx] + h] + point[idx+1:]
    point_minus = point[:idx] + [point[idx] - h] + point[idx+1:]
    return (f(*point_plus) - f(*point_minus)) / (2 * h)


# f(x,y) = x² + 2xy + y²  =>  ∂f/∂x = 2x+2y, ∂f/∂y = 2x+2y
def f_xy(x, y):
    return x**2 + 2*x*y + y**2

point = [1.0, 2.0]
df_dx = partial_numerical(lambda x, y: f_xy(x, y), 0, point)
df_dy = partial_numerical(lambda x, y: f_xy(x, y), 1, point)

print(f"f(x,y) = x² + 2xy + y² at (1,2)")
print(f"  ∂f/∂x ≈ {df_dx:.4f} (exact: 2*1+2*2=6)")
print(f"  ∂f/∂y ≈ {df_dy:.4f} (exact: 2*1+2*2=6)")

4. The Gradient¶

Gradient ∇f = [∂f/∂x₁, ∂f/∂x₂, ...] — a vector pointing in direction of steepest ascent.

To minimize f, we step in the opposite direction: x ← x - α ∇f

def gradient_numerical(f, point, h=1e-5):
    """Compute gradient (vector of partial derivatives) at point."""
    grad = []
    for i in range(len(point)):
        g = partial_numerical(f, i, point, h)
        grad.append(g)
    return grad


grad = gradient_numerical(lambda x, y: f_xy(x, y), [1.0, 2.0])
print(f"∇f(1,2) = {grad}")

5. Gradient Descent Algorithm¶

initialize weights w
for epoch in 1..max_epochs:
    compute gradient g = ∇L(w)
    w = w - learning_rate * g

Learning rate α: too small = slow, too large = divergence.

The gradient points uphill. Gradient descent goes downhill. ∇f points in the direction of steepest ascent. To minimize, we step opposite: x ← x - α ∇f. The learning rate α controls step size—too small = slow, too large = overshoot and possible divergence.

What just happened¶

We computed ∇f(1,2) = [6, 6] for f(x,y)=x²+2xy+y². At (1,2), the steepest ascent is in the (1,1) direction (both partials equal). To minimize, we'd step in (-1,-1).

def gradient_descent_simple(f, grad_f, w0, learning_rate=0.1, epochs=100):
    """Minimize f using gradient descent. grad_f(w) returns gradient at w."""
    w = w0[:]  # copy
    history = [f(*w)]
    for _ in range(epochs - 1):
        g = grad_f(*w)
        w = [w[i] - learning_rate * g[i] for i in range(len(w))]
        history.append(f(*w))
    return w, history


# Minimize f(x,y) = x² + y²  (min at (0,0))
# ∇f = (2x, 2y)
def f_simple(x, y):
    return x**2 + y**2

def grad_f_simple(x, y):
    return [2*x, 2*y]

w_final, history = gradient_descent_simple(f_simple, grad_f_simple, [3.0, 4.0],
                                           learning_rate=0.1, epochs=50)

print(f"Minimize f(x,y)=x²+y² starting at (3,4)")
print(f"  Final w: ({w_final[0]:.6f}, {w_final[1]:.6f})")
print(f"  Final loss: {history[-1]:.8f}")

6. Chain Rule¶

If L = f(g(x)), then \(\frac{dL}{dx} = \frac{dL}{df} \cdot \frac{df}{dg} \cdot \frac{dg}{dx}\)

Backpropagation = chain rule applied layer by layer through the network. Each layer multiplies its local gradient by the upstream gradient.

Learning rate: Too small = many tiny steps, slow convergence. Too large = overshoot the minimum, loss may oscillate or blow up. We used 0.1 here; for x²+y² from (3,4), 50 steps gets us close to (0,0). Try it yourself: Set learning_rate=0.5 or 0.01 and see how the final result and convergence change.

# Chain rule example: L = (wx + b - y)²
# Let u = wx + b - y, then L = u²
# dL/du = 2u, du/dw = x, du/db = 1, du/dx = w
# So: dL/dw = 2u * x, dL/db = 2u * 1

def mse_loss_gradients(w, b, x, y):
    """Gradients for L = (wx + b - y)² (one sample MSE)."""
    pred = w * x + b
    error = pred - y
    dL_dpred = 2 * error   # dL/du where u = pred - y
    dL_dw = dL_dpred * x   # chain: dL/dw = dL/dpred * dpred/dw
    dL_db = dL_dpred * 1   # chain: dL/db = dL/dpred * dpred/db
    return dL_dw, dL_db


# Check: w=1, b=0, x=2, y=5 => pred=2, error=-3
# dL_dw = 2*(-3)*2 = -12, dL_db = 2*(-3) = -6
print("Chain rule: L=(wx+b-y)²")
dL_dw, dL_db = mse_loss_gradients(1.0, 0.0, 2.0, 5.0)
print(f"  At w=1,b=0,x=2,y=5: dL/dw={dL_dw}, dL/db={dL_db}")

Chain rule: like multiplying percentages. If sales drop 10% and prices rise 5%, the combined effect multiplies: (0.9)×(1.05). For derivatives: dL/dw = (dL/du)×(du/dw). Each link multiplies. Backprop applies this layer by layer through the network.

7. Capstone: Linear Regression with Gradient Descent¶

Model: ŷ = wx + b (one feature)

Loss: MSE = (1/n) Σ (ŷᵢ - yᵢ)²

Gradients (averaged over batch): - ∂L/∂w = (2/n) Σ (ŷᵢ - yᵢ) · xᵢ - ∂L/∂b = (2/n) Σ (ŷᵢ - yᵢ)

Linear regression in plain English. You have points (x, y) on a scatter plot. You want to fit a line ŷ = wx + b that best predicts y from x. "Best" means minimizing the average squared error—the MSE. You start with random w and b, compute how wrong the predictions are (the loss), compute the gradient (which way to adjust w and b to reduce the loss), take a small step, and repeat. Gradient descent finds the w and b that minimize MSE. It's the simplest form of "learning"—and the same idea scales to neural networks.

import numpy as np


def linear_regression_gd(X, y, learning_rate=0.01, epochs=1000):
    """Linear regression y ≈ w*X + b using gradient descent."""
    X = np.array(X).flatten()
    y = np.array(y).flatten()
    n = len(X)
    w, b = 0.0, 0.0
    history = []

    for _ in range(epochs):
        pred = w * X + b
        error = pred - y

        # Gradients (MSE)
        dw = (2 / n) * np.sum(error * X)
        db = (2 / n) * np.sum(error)

        w -= learning_rate * dw
        b -= learning_rate * db

        mse = np.mean(error ** 2)
        history.append(mse)

    return w, b, history


# Simple synthetic data: y = 2x + 1 + noise
np.random.seed(42)
X = np.linspace(0, 10, 50)
y = 2 * X + 1 + np.random.randn(50) * 0.5

w, b, history = linear_regression_gd(X, y, learning_rate=0.01, epochs=500)

print("Linear regression: y ≈ wx + b")
print(f"  True: y = 2x + 1")
print(f"  Learned: y = {w:.3f}x + {b:.3f}")
print(f"  Final MSE: {history[-1]:.6f}")

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

ax1.scatter(X, y, alpha=0.6, label="Data")
x_line = np.linspace(0, 10, 100)
ax1.plot(x_line, w * x_line + b, "r-", lw=2, label=f"Fit: y={w:.2f}x+{b:.2f}")
ax1.set_xlabel("X")
ax1.set_ylabel("y")
ax1.set_title("Linear Regression Fit")
ax1.legend()

ax2.plot(history, color="green", alpha=0.8)
ax2.set_xlabel("Epoch")
ax2.set_ylabel("MSE")
ax2.set_title("Loss (MSE) over Training")
plt.tight_layout()
plt.show()

8. Multi-Feature Linear Regression¶

Model: ŷ = Xw + b (vectorized). Each row of X is a sample, w is weight vector.

Gradients: ∂L/∂w = (2/n) Xᵀ(ŷ - y), ∂L/∂b = (2/n) Σ(ŷ - y)

What just happened¶

We fit ŷ = wx + b to synthetic data (y = 2x + 1 + noise). Gradient descent converged: the loss (MSE) decreased over epochs and the learned w and b are close to the true 2 and 1. The left plot shows the fitted line through the scatter; the right plot shows the loss decreasing—convergence. If the loss kept bouncing or increased, the learning rate would be too high. If it barely moved, it would be too low.

def linear_regression_multi(X, y, learning_rate=0.01, epochs=1000):
    """Multi-feature linear regression: y = Xw + b."""
    X = np.array(X)
    if X.ndim == 1:
        X = X.reshape(-1, 1)
    y = np.array(y).reshape(-1, 1)
    n, d = X.shape

    w = np.zeros((d, 1))
    b = 0.0

    for _ in range(epochs):
        pred = X @ w + b
        error = pred - y
        dw = (2 / n) * (X.T @ error)
        db = (2 / n) * np.sum(error)
        w -= learning_rate * dw
        b -= learning_rate * db

    return w.flatten(), b


# y = 1 + 2*x1 + 3*x2 + noise
np.random.seed(42)
X_multi = np.random.randn(100, 2)
y_multi = 1 + 2 * X_multi[:, 0] + 3 * X_multi[:, 1] + np.random.randn(100) * 0.3

w_multi, b_multi = linear_regression_multi(X_multi, y_multi, epochs=2000)

print("Multi-feature regression: y = b + w1*x1 + w2*x2")
print(f"  True:  b=1, w1=2, w2=3")
print(f"  Learned: b={b_multi:.3f}, w={w_multi})

Chapter 3 Complete!¶

You've learned: - Notebook 01: Vectors, dot product, norms, cosine similarity (pure Python) - Notebook 02: Matrices, transpose, multiply, NumPy, images as tensors - Notebook 03: Derivatives, gradients, chain rule, gradient descent, linear regression from scratch

This math underpins every neural network. Next: Chapter 4 — Probability & Statistics for ML.

Generated by Berta AI | Created by Luigi Pascal Rondanini

Back to Ch 3 overview | Try in Playground | View on GitHub