Skip to content

Ch 6: Introduction to Machine Learning - Intermediate

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-06-intro-machine-learning/notebooks/02_intermediate.ipynb in Jupyter.


Chapter 6: Introduction to Machine Learning

Notebook 02 - Intermediate

Build on the basics: feature engineering, proper data splits, cross-validation, and robust evaluation.

What you'll learn: - Feature engineering: creating useful features from raw data - Training, validation, and test sets - Cross-validation with visualization - Evaluation metrics: accuracy, precision, recall, F1, confusion matrix - Bias-variance tradeoff, overfitting, and regularization

Time estimate: 3 hours


Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Feature Engineering

Feature engineering is like choosing the right ingredients for your recipe. The model can only "see" what you give it—the features are the model's view of the world. If you feed it raw sqft, it learns from sqft. If you also add sqft², it can capture curved relationships. Bad features = bad predictions, no matter how fancy your algorithm.

Common techniques: - Binning continuous variables (e.g., age into age groups) - Polynomial features (sqft, sqft²) for curved relationships - Encoding categorical variables (Male/Female → 0/1) - Normalization/scaling so features are on similar scales

What do you think will happen? If we use raw sqft and also sqft², will the model fit curved relationships better?

Creating curved data: We generate house data where price has a slight curve—bigger houses gain more value per sqft. This lets us compare linear vs polynomial features.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Generate data with a slight curve: price increases more at higher sqft
np.random.seed(42)
sqft = np.random.uniform(800, 3500, 120)
noise = np.random.normal(0, 25000, 120)
price = 150 * sqft + 0.02 * (sqft ** 2) + 40000 + noise

X = sqft.reshape(-1, 1)
y = price

What just happened: We created 120 houses with a curved price relationship. The data is ready for linear vs polynomial comparison.

Comparing linear vs polynomial: Below we add sqft² as a feature and fit both models. The polynomial model can capture the curve.

# Feature engineering: add polynomial terms
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Compare: linear vs polynomial features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_poly_train, X_poly_test = poly.fit_transform(X_train), poly.transform(X_test)

lr_linear = LinearRegression().fit(X_train, y_train)
lr_poly = LinearRegression().fit(X_poly_train, y_train)

mse_linear = mean_squared_error(y_test, lr_linear.predict(X_test))
mse_poly = mean_squared_error(y_test, lr_poly.predict(X_poly_test))

print(f"Linear features - Test MSE: ${mse_linear:,.0f}")
print(f"Polynomial features - Test MSE: ${mse_poly:,.0f}")

What just happened: The polynomial model typically achieves lower test MSE because our data has a curved relationship. Adding the right feature made a real difference.

Visualizing the fits: Side-by-side: linear (straight line) vs polynomial (curve).

# Plot 1: Linear vs polynomial fit
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
x_plot = np.linspace(sqft.min(), sqft.max(), 200).reshape(-1, 1)

axes[0].scatter(X_train, y_train, alpha=0.6, c='steelblue', label='Train')
axes[0].scatter(X_test, y_test, alpha=0.8, c='coral', marker='s', label='Test')
axes[0].plot(x_plot, lr_linear.predict(x_plot), 'r-', lw=2, label='Linear fit')
axes[0].set_xlabel('Square Feet')
axes[0].set_ylabel('Price ($)')
axes[0].set_title('Linear Model')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

x_poly_plot = poly.transform(x_plot)
axes[1].scatter(X_train, y_train, alpha=0.6, c='steelblue', label='Train')
axes[1].scatter(X_test, y_test, alpha=0.8, c='coral', marker='s', label='Test')
axes[1].plot(x_plot, lr_poly.predict(x_poly_plot), 'g-', lw=2, label='Polynomial fit')
axes[1].set_xlabel('Square Feet')
axes[1].set_ylabel('Price ($)')
axes[1].set_title('Polynomial (degree=2) Model')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What just happened: The polynomial curve bends to follow the data; the linear line oversimplifies. The plot shows why feature engineering matters.

2. Train / Validation / Test Sets

Why three sets instead of two?

  • Train: Fit the model. This is where learning happens.
  • Validation: Tune hyperparameters, choose between models, try different settings. You can "peek" at validation performance many times during development.
  • Test: Final evaluation only—used once at the very end. Never tune on the test set.

Using a separate validation set prevents "leaking" test information into model selection. If you tune on the test set, you're effectively teaching to the test—and your reported performance will be overly optimistic.

3. Cross-Validation

Instead of one exam, take five different exams and average your score. That's K-fold cross-validation. We split the data into K chunks (e.g., 5). Each chunk takes a turn being the "validation" set while we train on the other K-1. We get K different scores and average them. This gives a more reliable estimate than a single random split—we use all the data for both training and evaluation, just not at the same time.

K-fold CV: Split data into K folds. Train on K-1, validate on 1. Rotate. Average scores.

What do you think will happen? With 5-fold CV, will each fold give exactly the same score? Why or why not?

Running 5-fold CV: Below we train 5 different models (each on 80% of the data) and get 5 validation scores.

from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(LinearRegression(), X_poly_train, y_train, cv=kf, scoring='neg_mean_squared_error')
scores = -scores  # neg MSE -> MSE

print("Cross-validation MSE per fold:")
for i, s in enumerate(scores):
    print(f"  Fold {i+1}: ${s:,.0f}")
print(f"Mean: ${scores.mean():,.0f}, Std: ${scores.std():,.0f}")

What just happened: Each fold had a different train/validation split, so scores vary slightly. The mean and standard deviation tell us how stable the model is across different data subsets.

Visualizing fold performance: The bar chart shows MSE for each of the 5 folds.

Plotting CV results: Below we visualize each fold's MSE.

# Plot 2: Cross-validation fold performance
plt.figure(figsize=(8, 4))
plt.bar(range(1, 6), scores, color='steelblue', edgecolor='navy')
plt.axhline(y=scores.mean(), color='red', linestyle='--', label=f'Mean: ${scores.mean():,.0f}')
plt.xlabel('Fold')
plt.ylabel('MSE ($²)')
plt.title('5-Fold Cross-Validation: MSE per Fold')
plt.legend()
plt.xticks(range(1, 6))
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

What just happened: Bars show each fold's MSE. The red line is the average. Variability across folds reflects how sensitive the model is to which data it sees.

4. Evaluation Metrics for Classification

For classification (e.g., churn yes/no, spam/not spam), one number isn't enough. Here's each metric with a real scenario:

  • Accuracy: What percentage did the model get right? (TP + TN) / total. Simple, but misleading when classes are imbalanced—a spam filter that says "not spam" to everything is 99% accurate if 99% of emails aren't spam, but it's useless.

  • Precision: When the model says YES, how often is it right? TP / (TP + FP). For a spam filter: of the emails we put in the spam folder, how many were actually spam? High precision = few false alarms.

  • Recall: Of all the actual positives, how many did the model find? TP / (TP + FN). For cancer screening: of all people who have cancer, how many did we catch? High recall = we don't miss the important cases.

  • F1: The balance between precision and recall. Harmonic mean. Use when you need both—catching all defaults (recall) without too many false alarms (precision).

See assets/diagrams/model_evaluation.svg for the confusion matrix.

Classification example: We turn our regression task into a binary one (price above/below median) and compute all metrics.

# Simulate classification: predict if price > median
from sklearn.linear_model import LogisticRegression

y_class = (y > np.median(y)).astype(int)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_poly, y_class, test_size=0.2, random_state=42)

clf = LogisticRegression(max_iter=1000).fit(X_train_c, y_train_c)
y_pred = clf.predict(X_test_c)

cm = confusion_matrix(y_test_c, y_pred)
print("Confusion Matrix:")
print("              Predicted")
print("              Neg    Pos")
print(f"Actual Neg   {cm[0,0]:4d}  {cm[0,1]:4d}")
print(f"Actual Pos   {cm[1,0]:4d}  {cm[1,1]:4d}")
print()
print(f"Accuracy:  {accuracy_score(y_test_c, y_pred):.3f}")
print(f"Precision: {precision_score(y_test_c, y_pred, zero_division=0):.3f}")
print(f"Recall:    {recall_score(y_test_c, y_pred, zero_division=0):.3f}")
print(f"F1:        {f1_score(y_test_c, y_pred, zero_division=0):.3f}")

What just happened: The confusion matrix shows Actual vs Predicted. Diagonal = correct. Off-diagonal = errors. The metrics summarize this in different ways.

Confusion matrix heatmap: Visualizing where the model succeeds and fails.

# Plot 3: Confusion matrix heatmap
fig, ax = plt.subplots(figsize=(6, 4))
im = ax.imshow(cm, cmap='Blues')
ax.set_xticks([0, 1])
ax.set_yticks([0, 1])
ax.set_xticklabels(['Pred Neg', 'Pred Pos'])
ax.set_yticklabels(['Act Neg', 'Act Pos'])
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
for i in range(2):
    for j in range(2):
        ax.text(j, i, str(cm[i, j]), ha='center', va='center', fontsize=16)
plt.colorbar(im, ax=ax, label='Count')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

What just happened: Darker cells = more samples. A good model has strong diagonal; a bad one has scattered off-diagonal counts.

5. Bias-Variance Tradeoff

Underfitting is like memorizing nothing—you can't answer any question. Overfitting is like memorizing the textbook including the typos—you ace the practice problems but fail on new ones.

  • Underfitting (high bias): Model too simple. Misses real patterns. Train and test error both high.
  • Good fit: Balanced. Captures the signal, ignores the noise.
  • Overfitting (high variance): Model too complex. Fits the noise. Train error low, test error high.

See assets/diagrams/bias_variance.svg for a visual.

What do you think will happen? As we increase polynomial degree, will test error decrease indefinitely or eventually increase?

Sweeping polynomial degree: We fit models from degree 1 to 11 and plot train vs test MSE.

# Plot 4: Train vs Test error vs polynomial degree (bias-variance)
degrees = range(1, 12)
train_errors, test_errors = [], []

for d in degrees:
    poly_d = PolynomialFeatures(degree=d, include_bias=False)
    X_tr = poly_d.fit_transform(X_train)
    X_te = poly_d.transform(X_test)
    model = LinearRegression().fit(X_tr, y_train)
    train_errors.append(mean_squared_error(y_train, model.predict(X_tr)))
    test_errors.append(mean_squared_error(y_test, model.predict(X_te)))

plt.figure(figsize=(8, 5))
plt.plot(degrees, train_errors, 'b-o', label='Train MSE')
plt.plot(degrees, test_errors, 'r-s', label='Test MSE')
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.title('Bias-Variance: Train vs Test Error')
plt.legend()
plt.xticks(degrees)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What just happened: Train error keeps dropping (model fits data better) but test error eventually rises (overfitting). The gap between train and test tells the story.

6. Overfitting and Regularization

Regularization is putting the model on a diet. It penalizes complex models so they don't memorize the training data. Ridge (L2) adds \(\lambda \sum w^2\) to the loss—it shrinks all weights toward zero. Lasso (L1) can set some weights to exactly zero, doing automatic feature selection.

With Ridge, we can use high-degree polynomials without overfitting. The penalty keeps the curve smooth.

Comparing raw vs regularized: Degree-10 polynomial without regularization (wiggly, overfits) vs with Ridge (smooth, generalizes).

# Plot 5: Regularization effect - fit comparison
poly_high = PolynomialFeatures(degree=10, include_bias=False)
X_high_train = poly_high.fit_transform(X_train)
X_high_test = poly_high.transform(X_test)

lr_raw = LinearRegression().fit(X_high_train, y_train)
ridge = Ridge(alpha=1e5).fit(X_high_train, y_train)

x_plot = np.linspace(sqft.min(), sqft.max(), 300).reshape(-1, 1)
x_plot_poly = poly_high.transform(x_plot)

plt.figure(figsize=(10, 5))
plt.scatter(X_train, y_train, alpha=0.6, c='steelblue', label='Train')
plt.scatter(X_test, y_test, alpha=0.8, c='coral', marker='s', label='Test')
plt.plot(x_plot, lr_raw.predict(x_plot_poly), 'r-', lw=2, label='Linear (deg=10, overfitting)')
plt.plot(x_plot, ridge.predict(x_plot_poly), 'g-', lw=2, label='Ridge (deg=10, regularized)')
plt.xlabel('Square Feet')
plt.ylabel('Price ($)')
plt.title('Overfitting vs Regularization (Ridge)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Linear (deg=10) Test MSE: ${mean_squared_error(y_test, lr_raw.predict(X_high_test)):,.0f}")
print(f"Ridge (deg=10) Test MSE:  ${mean_squared_error(y_test, ridge.predict(X_high_test)):,.0f}")

What just happened: The red curve wiggles through every training point; the green curve captures the trend. Ridge's lower test MSE proves it generalizes better.

Summary

  • Feature engineering creates better inputs (polynomials, encodings, scaling)
  • Train/Val/Test and cross-validation ensure robust evaluation
  • Metrics: Accuracy, Precision, Recall, F1, Confusion Matrix
  • Bias-variance: Simple models underfit, complex models overfit
  • Regularization (Ridge, Lasso) reduces overfitting

Next: End-to-end capstone—customer churn prediction.


Generated by Berta AI | Created by Luigi Pascal Rondanini


Back to Ch 6 overview | Try in Playground | View on GitHub