Ch 7: Supervised Learning - Intermediate¶

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-07-supervised-learning/notebooks/02_intermediate.ipynb in Jupyter.

Chapter 7: Supervised Learning - Regression & Classification¶

Notebook 02 - Intermediate: Classification¶

Classification predicts discrete labels. We cover logistic regression, decision trees, SVMs, and evaluation with ROC curves.

What you'll learn: - Logistic regression from scratch - Decision boundaries and tree structure - Support Vector Machines and the kernel trick - ROC curves, AUC, precision-recall - Handling imbalanced classes

Time estimate: 3.5 hours

Try it yourself: Change the decision threshold from 0.5 to 0.3 or 0.7. See how precision and recall trade off.

Common mistakes: Using accuracy for imbalanced data, ignoring the precision-recall tradeoff, or not scaling features for SVM.

Generated by Berta AI | Created by Luigi Pascal Rondanini

In this notebook we move from regression to classification: predicting discrete labels (e.g., default vs. no default) instead of continuous values. We start with logistic regression—the workhorse of binary classification—then explore decision trees, SVMs, and evaluation tools like ROC curves. Imbalanced classes require special handling, which we demonstrate on a credit default dataset.

1. Logistic Regression From Scratch¶

Despite the name, logistic regression is for classification, not regression. It outputs a probability using the sigmoid function. The sigmoid squashes any number into (0, 1). We get P(y=1|X) and threshold at 0.5 for class decisions.

Sigmoid: σ(z) = 1/(1+e^{-z}) — outputs probability in (0, 1)

Update rule: Gradient descent on cross-entropy loss

Despite the name, logistic regression is for classification, not regression. It uses the sigmoid function to squash predictions between 0 and 1, like a probability. The sigmoid maps any real number to the interval (0, 1), so we can interpret the output as P(y=1|X)—the chance the example belongs to the positive class. We then threshold at 0.5 to make a class decision: above 0.5 predict class 1, below predict class 0. Internally, it learns weights via gradient descent on cross-entropy loss, which penalizes confident wrong predictions.

Implementing logistic regression from scratch: We define the sigmoid, fit with gradient descent on cross-entropy loss, and evaluate on synthetic data.

import numpy as np
import matplotlib.pyplot as plt

class LogisticRegressionScratch:
    def __init__(self, lr=0.1, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = None
        self.b = None

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y).ravel()
        if X.ndim == 1:
            X = X.reshape(-1, 1)
        n, p = X.shape
        self.w = np.zeros(p)
        self.b = 0.0
        for _ in range(self.epochs):
            logits = X @ self.w + self.b
            probs = self.sigmoid(logits)
            err = probs - y
            self.w -= self.lr * (X.T @ err) / n
            self.b -= self.lr * np.mean(err)
        return self

    def predict_proba(self, X):
        X = np.asarray(X)
        if X.ndim == 1:
            X = X.reshape(-1, 1)
        return self.sigmoid(X @ self.w + self.b)

    def predict(self, X, thresh=0.5):
        return (self.predict_proba(X) >= thresh).astype(int)

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=150, n_features=2, n_redundant=0, n_informative=2, random_state=42)
logr = LogisticRegressionScratch(epochs=500)
logr.fit(X, y)
acc = np.mean(logr.predict(X) == y)
print(f'Accuracy (from scratch): {acc:.4f}')

What to observe: The accuracy printed shows how well the model fits the training data. The learned weights define a linear decision boundary—in 2D, a straight line. Notice that logistic regression assumes the classes can be separated by a line; for more complex shapes we need other methods.

What to observe: The LogisticRegressionScratch class implements the core logic: sigmoid for probabilities, gradient descent for learning. The accuracy on the synthetic dataset shows the model fits; we will visualize the learned boundary next.

What just happened: We trained our from-scratch logistic regression and got the accuracy. The model learned a linear decision boundary.

Plotting the decision boundary: We create a mesh grid, predict P(y=1) everywhere, and draw the 0.5 contour. Colors show confidence.

The decision boundary is the surface that separates the predicted classes in feature space. In 2D it is a line or curve; in higher dimensions it is a hyperplane or manifold. Visualizing it helps us understand how the model divides the data and where it is most confident vs. uncertain. For logistic regression, the boundary is where P(y=1) = 0.5—the model is equally unsure on that line.

Plotting the boundary in 2D makes the model's behavior tangible. We create a dense grid of points, predict P(y=1) at each, and draw contours. The 0.5 contour is the decision boundary; the color gradient shows confidence. This visualization will reappear for each classifier—compare how logistic regression, trees, and SVM draw different boundaries on the same data.

2. Decision Boundary Visualization¶

The decision boundary is the line (or curve) that separates classes. Imagine two neighborhoods divided by a road—one side blue, the other red. The boundary is where P(y=1|X) = 0.5. For logistic regression in 2D, it is a straight line.

Run the cell below, then observe: The plot shows the decision boundary (black line) and a color gradient for P(y=1). Blue regions are confident class 0; red regions are confident class 1. The boundary is linear for logistic regression. Points near the boundary are where the model is most uncertain.

def plot_decision_boundary(model, X, y, title='Decision Boundary'):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu_r', levels=20)
    plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu_r', edgecolors='black', s=50)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.colorbar(label='P(y=1)')
    plt.tight_layout()
    plt.show()

plot_decision_boundary(logr, X, y, 'Logistic Regression: Decision Boundary')

What to observe: The plot_decision_boundary function generalizes to any model with predict_proba. We reuse it later for other classifiers. The 0.5 contour is where the model is indifferent between the two classes.

A decision tree is like the game "20 Questions": it asks yes/no questions about the features (Is income > 50k? Is credit score < 600?) and follows branches until it reaches a leaf with a prediction. Each internal node splits on one feature; each leaf assigns a class. This makes trees highly interpretable—you can trace the path for any prediction. The tree structure shows which features matter most at each level; limiting depth (e.g., max_depth=3) prevents overfitting.

3. Decision Trees¶

A decision tree is a flowchart for making predictions—like the game 20 Questions. Is feature 1 > 0.5? Yes → go left. No → go right. Each leaf gives a class. Trees are easy to interpret.

Trees split on features to create regions. Each leaf predicts a class. See assets/diagrams/decision_tree.svg for a loan approval example.

Run the cell below, then observe: The left plot shows the tree structure—splits on F1 and F2 with the decision threshold at each node. The right plot shows the decision boundary as colored regions; trees produce step-like (axis-aligned) boundaries. Compare this to the smooth logistic regression boundary; trees can capture more complex, non-linear separations.

Choosing a kernel is a modeling decision: linear SVM is faster and works when classes are roughly separable by a line; RBF kernel can capture complex, curved boundaries but is more prone to overfitting if gamma is too large. Scale your features before fitting—SVM is sensitive to feature magnitudes.

SVM seeks the widest possible "road" between two neighborhoods of different classes. Imagine two clusters of points—SVM finds the boundary that maximizes the margin, the distance from the line to the nearest points of each class. Those nearest points are the support vectors. For non-linearly separable data, the kernel trick implicitly maps features to a higher dimension where a linear boundary works—like drawing a road in 3D that separates the classes when viewed from above. Linear kernel keeps the boundary straight; RBF kernel allows curved boundaries.

ROC curves and AUC are standard tools for evaluating classifiers, especially when class balance is uneven. By varying the decision threshold, we trace a curve from (0,0) to (1,1). A good model stays near the top-left (high TPR, low FPR). The area under this curve—AUC—summarizes ranking quality: 1.0 is perfect, 0.5 is random.

from sklearn.tree import DecisionTreeClassifier, plot_tree

dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X, y)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
plot_tree(dt, ax=axes[0], filled=True, feature_names=['F1', 'F2'], class_names=['0', '1'])
axes[0].set_title('Decision Tree Structure')

# Decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = dt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
axes[1].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu_r')
axes[1].scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu_r', edgecolors='black', s=50)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Decision Tree: Colored Regions')
plt.tight_layout()
plt.show()

Run the cell below, then observe: The linear SVM (left) draws a straight boundary; the RBF SVM (right) can bend to fit the data. When the classes are not linearly separable, the RBF kernel often gives a better fit. Notice how the margin—the gap between the boundary and the nearest points—is maximized in both cases.

4. Support Vector Machines and Kernel Trick¶

SVM finds the widest possible gap between classes—like drawing a road between two neighborhoods with maximum margin. The kernel trick maps data to higher dimensions for nonlinear boundaries. Linear kernel = straight boundary; RBF kernel = curved.

from sklearn.svm import SVC

# Linear SVM
svm_linear = SVC(kernel='linear', C=1.0).fit(X, y)

# RBF kernel - nonlinear boundary
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale').fit(X, y)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, model, title in [(axes[0], svm_linear, 'SVM Linear'), (axes[1], svm_rbf, 'SVM RBF Kernel')]:
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu_r')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu_r', edgecolors='black', s=50)
    ax.set_title(title)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
plt.tight_layout()
plt.show()

The ROC curve plots the trade-off between two rates at different classification thresholds. The x-axis is False Positive Rate (how many negatives we wrongly flag as positive); the y-axis is True Positive Rate (how many actual positives we catch). A curve that hugs the top-left corner means the model distinguishes classes well. AUC (Area Under the Curve) summarizes this: 1.0 is perfect, 0.5 is random guessing. AUC is threshold-independent—it measures ranking quality across all possible thresholds.

5. ROC Curves and AUC¶

ROC asks: How good is the model at distinguishing classes? X-axis = False Positive Rate; Y-axis = True Positive Rate. AUC = 1.0 means perfect, 0.5 means random. The precision-recall tradeoff: you cannot have it all—stricter thresholds catch more true positives but also more false alarms.

ROC: True Positive Rate vs False Positive Rate at different thresholds. AUC: Area under ROC curve.

from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
logr_sk = LogisticRegression(max_iter=1000)
logr_sk.fit(X_train, y_train)
y_proba = logr_sk.predict_proba(X_test)[:, 1]

fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(fpr, tpr, 'darkorange', lw=2, label=f'AUC = {roc_auc:.3f}')
axes[0].plot([0, 1], [0, 1], 'navy', lw=2, linestyle='--')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve')
axes[0].legend()
axes[0].grid(alpha=0.3)

prec, rec, _ = precision_recall_curve(y_test, y_proba)
axes[1].plot(rec, prec, 'darkgreen', lw=2)
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()

What to observe: The ROC curve (left) shows how TPR vs FPR changes as we vary the threshold; the dashed line is random. AUC close to 1 means strong discrimination. The precision-recall curve (right) is often more informative for imbalanced data—precision drops as we lower the threshold to catch more positives.

When most examples are one class (e.g., 99% no default), a model that always predicts the majority class gets 99% accuracy—but catches zero defaults. That is the "99% accuracy trap": accuracy is misleading with imbalanced data. We care about the rare class (defaults, fraud, disease), so we need metrics that focus on it: recall, precision, F1, and ROC-AUC. Always inspect class balance and use stratified splits so the rare class appears in train and test.

Learning curves are a diagnostic tool: they show whether we need more data (validation improves with more training points) or a more complex model (validation plateaus while train keeps improving). They also help detect overfitting when train and validation diverge.

6. Imbalanced Classes: Credit Default¶

When 99% of emails are not spam, a model that says not spam every time is 99% accurate but useless. Imbalanced classes break accuracy. Use class_weight=balanced, oversampling, or focus on recall/AUC. For credit default, we care about catching defaults without too many false declines.

import pandas as pd
from pathlib import Path

df = pd.read_csv(Path('..') / 'datasets' / 'credit.csv')
X = df.drop(columns=['default']).values
y = df['default'].values
print(f'Default rate: {y.mean():.2%}')

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Without class weight
lr_default = LogisticRegression(max_iter=1000).fit(X_train_s, y_train)
pred_default = lr_default.predict(X_test_s)

# With class_weight='balanced'
lr_balanced = LogisticRegression(max_iter=1000, class_weight='balanced').fit(X_train_s, y_train)
pred_balanced = lr_balanced.predict(X_test_s)

print('\nDefault (no weights):')
print(classification_report(y_test, pred_default, target_names=['No default', 'Default']))
print('Balanced class weights:')
print(classification_report(y_test, pred_balanced, target_names=['No default', 'Default']))

What to observe: Compare the two classification reports. Without class_weight='balanced', the model often predicts "No default" for nearly everyone—high accuracy but zero or low recall on defaults. With balanced weights, recall on the Default class improves; we catch more actual defaults at the cost of some extra false positives. The default rate printed shows how imbalanced the data is.

Common mistake: Relying on accuracy with imbalanced data. A model that always predicts the majority class can achieve 99% accuracy while being useless for the task. Always use precision, recall, F1, and ROC-AUC when the positive class is rare.

Interactive: Predict default risk¶

Enter applicant features and get probability of default.

# Prediction prompt - customer classification
age = 35
income = 50000
debt_ratio = 0.25
credit_score = 720
employment_years = 8

applicant = scaler.transform([[age, income, debt_ratio, credit_score, employment_years]])
prob_default = lr_balanced.predict_proba(applicant)[0, 1]
print(f'Applicant: age={age}, income=${income}, debt_ratio={debt_ratio}, credit_score={credit_score}')
print(f'Predicted default probability: {prob_default:.1%}')
print(f'Recommendation: {"Approve" if prob_default < 0.3 else "Review" if prob_default < 0.6 else "Decline"}')

Try it yourself: Modify the applicant features above (age, income, debt_ratio, credit_score, employment_years) and re-run the cell. Try a low credit score and high debt ratio—see how the default probability and recommendation change. Experiment with different thresholds in the recommendation logic.

7. Learning Curves¶

Plot train vs validation score vs sample size — diagnose overfitting/underfitting.

from sklearn.model_selection import learning_curve

train_sizes, train_scores, val_scores = learning_curve(
    LogisticRegression(max_iter=1000), X_train_s, y_train, cv=5,
    train_sizes=np.linspace(0.2, 1.0, 10)
)

plt.figure(figsize=(8, 5))
plt.plot(train_sizes, train_scores.mean(axis=1), 'b-o', label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), 'r-o', label='Validation')
plt.fill_between(train_sizes, train_scores.mean(axis=1) - train_scores.std(axis=1),
                 train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.2)
plt.fill_between(train_sizes, val_scores.mean(axis=1) - val_scores.std(axis=1),
                 val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.2)
plt.xlabel('Training set size')
plt.ylabel('Accuracy')
plt.title('Learning Curves: Credit Default (Logistic Regression)')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

What to observe: Training accuracy (blue) is typically higher than validation (red). If they converge and stay close, the model has enough data and is not overfitting. A large gap suggests overfitting; a flat validation curve with low scores suggests underfitting or need for more data. The shaded regions show cross-validation variance.

Generated by Berta AI | Created by Luigi Pascal Rondanini

Back to Ch 7 overview | Try in Playground | View on GitHub