Skip to content

Ch 7: Supervised Learning - Introduction

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-07-supervised-learning/notebooks/01_introduction.ipynb in Jupyter.


Chapter 7: Supervised Learning - Regression & Classification

Notebook 01 - Introduction: Regression

Regression predicts continuous values. We start with the classic linear regression and build up to polynomial and regularized variants.

What you'll learn: - Linear regression from scratch with NumPy - Multiple and polynomial regression - Overfitting and regularization (Ridge, Lasso) - Scikit-learn interface: fit, predict, score

Time estimate: 3 hours


Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Linear Regression: Theory

Linear regression is finding the line of best fit. Given points (x, y), we want the line that minimizes the sum of squared errors. Real example: bigger house = higher price. The line captures how much price increases per extra sqft.

Goal: Predict y from X using y = X*beta + epsilon

Closed-form (normal equation): beta = (X^T X)^{-1} X^T y

Gradient descent: Minimize MSE

Implementing linear regression: We generate synthetic data and fit using the normal equation. The plot shows the data and fitted line.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 80
X = np.random.uniform(0, 10, n)
y = 2.5 * X + 1.5 + np.random.randn(n) * 2

# Linear regression from scratch - normal equation
X_b = np.c_[np.ones((n, 1)), X]
beta = np.linalg.lstsq(X_b.T @ X_b, X_b.T @ y, rcond=None)[0]
b, w = beta[0], beta[1]
y_pred = X_b @ beta

fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(X, y, alpha=0.7, label='Data')
ax.plot(X, y_pred, 'r-', lw=2, label=f'ŷ = {w:.2f}x + {b:.2f}')
ax.set_xlabel('Feature X')
ax.set_ylabel('Target y')
ax.set_title('Linear Regression: Data + Fitted Line')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

What just happened: We fit the line and plotted it. The red line minimizes squared error.

2. Residual Plot

Residuals = \(y - \hat{y}\) — the vertical distance from each point to the line. Good fits have residuals centered around 0 with no pattern. If you see a curve in the residual plot, the relationship may be nonlinear; if variance grows with predicted value, you may need different modeling.

residuals = y - y_pred

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(y_pred, residuals, alpha=0.7)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_xlabel('Predicted values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Predicted')
axes[1].hist(residuals, bins=15, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
plt.tight_layout()
plt.show()

What just happened: Residuals vs predicted (left) and residual distribution (right). Random scatter around 0 indicates a good fit.

Loading housing data: We use multiple features (sqft, bedrooms, etc.) to predict price. We scale features and fit with sklearn's LinearRegression.

3. Multiple Linear Regression - Housing Prices

Now the price depends on size AND location AND age... Multiple regression adds more features. Each has a coefficient. The model learns these from data. Real-world: Predict house price from sqft, bedrooms, bathrooms, age, location_score.

What just happened: We fit multiple regression and got R² and coefficients. Each coefficient is the predicted price change per unit of that feature (holding others constant).

import pandas as pd
from pathlib import Path

df = pd.read_csv(Path('..') / 'datasets' / 'housing.csv')
X = df[['sqft', 'bedrooms', 'bathrooms', 'age', 'location_score']].values
y = df['price'].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

lr = LinearRegression()
lr.fit(X_train_s, y_train)
r2 = lr.score(X_test_s, y_test)
print(f'R² on test set: {r2:.4f}')
print('Coefficients:', dict(zip(df.columns[:-1], lr.coef_.round(2))))

Interactive: Predict a house price

Try different values below to see the predicted price.

# Prediction prompt - customize these values
sqft = 2000
bedrooms = 3
bathrooms = 2.0
age = 10
location_score = 7.5

new_house = scaler.transform([[sqft, bedrooms, bathrooms, age, location_score]])
pred_price = lr.predict(new_house)[0]
print(f'Predicted price for {sqft} sqft, {bedrooms} bed, {bathrooms} bath, age {age}: ${pred_price:,.0f}')

Visualizing overfitting: We fit degree 1, 3, and 12 polynomials. Degree 12 passes through every point but wobbles—classic overfitting.

4. Polynomial Regression and Overfitting

Sometimes the relationship is not a straight line. Polynomial features (x, x^2, x^3) let us capture curves. But overfitting is like a curve that passes through every point but wobbles wildly - it cannot predict new points. Degree 12 = wiggly curve that memorizes noise.

What just happened: Ridge shrinks all coefficients; Lasso drives some to zero. The plots show how coefficients change with regularization strength (alpha).

from sklearn.preprocessing import PolynomialFeatures

np.random.seed(42)
X = np.linspace(0, 4*np.pi, 60)
y = np.sin(X) + np.random.randn(60) * 0.3
X = X.reshape(-1, 1)

degrees = [1, 3, 12]
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
X_plot = np.linspace(0, 4*np.pi, 200).reshape(-1, 1)

for i, deg in enumerate(degrees):
    poly = PolynomialFeatures(degree=deg)
    X_poly = poly.fit_transform(X)
    lr = LinearRegression().fit(X_poly, y)
    X_plot_poly = poly.transform(X_plot)
    y_plot = lr.predict(X_plot_poly)
    axes[i].scatter(X, y, alpha=0.6)
    axes[i].plot(X_plot, y_plot, 'r-', lw=2)
    axes[i].set_title(f'Degree {deg}' + (' (overfit!)' if deg == 12 else ''))
    axes[i].set_xlabel('X')
plt.tight_layout()
plt.show()

5. Regularization: Ridge (L2) and Lasso (L1)

Ridge (L2): Penalizes large weights - shrinks them toward zero. Use when many features matter.

Lasso (L1): Can set weights to exactly zero - automatic feature selection. Use when you want a sparse model.

When to use which: Need sparse features? Lasso. Many correlated features? Ridge.

What just happened: We compared OLS, Ridge, and Lasso. Ridge and Lasso often generalize better when there are many or correlated features. Try different alpha values to see the effect.

flowchart TD
    A[Need regression?] --> B{Target continuous?}
    B -->|Yes| C[Regression]
    B -->|No| D[Classification - see notebook 02]
    C --> E{Linear relationship?}
    E -->|Yes| F[Linear Regression]
    E -->|No| G[Polynomial Regression]
    F --> H{Many features / overfitting?}
    G --> H
    H -->|Yes| I{Ridge or Lasso?}
    H -->|No| J[Use OLS]
    I -->|Shrink all| K[Ridge L2]
    I -->|Feature selection| L[Lasso L1]
from sklearn.linear_model import Ridge, Lasso

alphas = np.logspace(-4, 2, 50)
coefs_ridge = []
coefs_lasso = []
for a in alphas:
    ridge = Ridge(alpha=a).fit(X_train_s, y_train)
    lasso = Lasso(alpha=a).fit(X_train_s, y_train)
    coefs_ridge.append(ridge.coef_)
    coefs_lasso.append(lasso.coef_)

coefs_ridge = np.array(coefs_ridge)
coefs_lasso = np.array(coefs_lasso)
feat_names = ['sqft', 'bedrooms', 'bathrooms', 'age', 'location_score']

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for i, name in enumerate(feat_names):
    axes[0].plot(alphas, coefs_ridge[:, i], label=name, lw=2)
    axes[1].plot(alphas, coefs_lasso[:, i], label=name, lw=2)
axes[0].set_xscale('log')
axes[0].set_xlabel('α (regularization strength)')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Ridge (L2): Coefficient Shrinkage')
axes[0].legend()
axes[1].set_xscale('log')
axes[1].set_xlabel('α (regularization strength)')
axes[1].set_ylabel('Coefficient')
axes[1].set_title('Lasso (L1): Feature Selection (→0)')
axes[1].legend()
plt.tight_layout()
plt.show()

6. Scikit-learn Interface

Scikit-learn is the most popular ML library in Python. Every estimator follows: model.fit(X, y) to train, model.predict(X) to predict, model.score(X, y) for R2 or accuracy. Once you learn this pattern, you can use hundreds of models.

# Compare OLS, Ridge, Lasso on housing
from sklearn.linear_model import LinearRegression, Ridge, Lasso

models = [
    ('OLS', LinearRegression()),
    ('Ridge', Ridge(alpha=1.0)),
    ('Lasso', Lasso(alpha=0.1))
]
for name, model in models:
    model.fit(X_train_s, y_train)
    print(f'{name:8} R² = {model.score(X_test_s, y_test):.4f}')

Generated by Berta AI | Created by Luigi Pascal Rondanini


Back to Ch 7 overview | Try in Playground | View on GitHub