Ch 6: Introduction to Machine Learning - Introduction¶

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-06-intro-machine-learning/notebooks/01_introduction.ipynb in Jupyter.

Chapter 6: Introduction to Machine Learning¶

Notebook 01 - Introduction¶

Machine learning is the art of teaching computers to learn from data—without being explicitly programmed for every rule.

What you'll learn: - What is machine learning? (learning from data vs explicit programming) - Types: supervised, unsupervised, reinforcement learning - The ML pipeline end-to-end - Your first ML model: predict house prices with linear regression - Train/test split: why it matters

Time estimate: 2.5 hours

Generated by Berta AI | Created by Luigi Pascal Rondanini

1. What is Machine Learning?¶

Machine learning is teaching computers to learn from examples instead of explicit rules. Think of it this way: you don't program a child to recognize a cat by writing thousands of "if pixel 23 is orange and pixel 24 is black then maybe ear" rules. Instead, you show the child many pictures and say "this is a cat, this isn't." The child learns the pattern. ML does the same for computers.

Traditional programming: You write rules → Computer applies them → Output. You must anticipate every case.

Machine learning: You provide data + desired output → Computer learns rules → Predictions on new data. The algorithm discovers patterns from examples.

Why does this matter? Many real-world problems—recognizing faces, predicting stock moves, detecting fraud—are too complex for hand-crafted rules. The space of possibilities is enormous. But if we have enough examples, the computer can find the underlying structure. That's the power of ML.

2. Types of Machine Learning¶

The three main branches of ML differ in what the model learns from. Supervised vs unsupervised is the key split—here's a simple analogy: Studying with an answer key (supervised) vs studying without one (unsupervised). With the answer key, you can check your work and improve. Without it, you can only look for patterns in the questions themselves. Supervised learning has labeled data (the answers); unsupervised doesn't.

flowchart TB
    subgraph ML[Machine Learning]
        S[Supervised Learning]
        U[Unsupervised Learning]
        R[Reinforcement Learning]
    end
    S --> |Labeled data| S1[Classification: spam/not spam]
    S --> |Labeled data| S2[Regression: house price]
    U --> |No labels| U1[Clustering: group similar customers]
    U --> |No labels| U2[Dimensionality reduction]
    R --> |Reward signal| R1[Game AI, robotics]

Type	Input	Output	Example
Supervised	Features + Labels	Predict label for new data	Spam filter, house price
Unsupervised	Features only	Discover structure	Customer segmentation
Reinforcement	Actions + Rewards	Optimal policy	Game playing, robots

Interactive: Is this supervised or unsupervised?¶

Before we proceed, think about each scenario:

Predicting whether a customer will churn (leave) based on their usage data — supervised or unsupervised?
Grouping news articles into topics without predefined categories — supervised or unsupervised?
Estimating the sale price of a house given its features — supervised or unsupervised?

Answers: 1. Supervised (classification), 2. Unsupervised (clustering), 3. Supervised (regression)

3. The ML Pipeline¶

Every ML project follows a flow. See the diagram in assets/diagrams/ml_pipeline.svg:

Data Collection → Cleaning → Feature Engineering → Train/Val/Test Split → Model Training → Evaluation → Deployment

Displaying the pipeline: The cell below loads and shows the ML pipeline diagram. This gives you a visual roadmap of every ML project—from raw data to deployed model.

# Display the ML pipeline diagram
from IPython.display import display, SVG
import os

pipeline_path = os.path.join("..", "..", "assets", "diagrams", "ml_pipeline.svg")
if os.path.exists(pipeline_path):
    display(SVG(filename=pipeline_path))
else:
    print("Pipeline: Data Collection → Cleaning → Feature Engineering → Split → Training → Evaluation → Deployment")

What just happened: The code displayed the ML pipeline diagram (or a text summary if the file wasn't found). This pipeline is the backbone of every ML project—you'll follow it again and again.

4. Your First ML Model: House Price Prediction¶

Linear regression is like drawing the best straight line through messy data. Imagine you have 50 houses with sizes and prices plotted on a graph—a cloud of points where bigger houses tend to cost more. Linear regression finds the single line that fits that cloud best. In most markets, adding 100 sqft adds roughly \(15,000-\)25,000 to price; the line captures that. We'll predict house prices using linear regression—the simplest supervised learning model.

What do you think will happen? If we plot house size (sqft) vs price, do you expect a roughly linear relationship? Why or why not?

Before we fit a model, we need data. The cell below creates synthetic house data so we can see the full pipeline. In a real project, you'd load this from a CSV or database.

import numpy as np
import matplotlib.pyplot as plt

# Realistic house data: sqft -> price (with some noise)
# Based on typical market: ~$200/sqft baseline + noise
np.random.seed(42)
sqft = np.random.uniform(800, 3500, 50)
noise = np.random.normal(0, 30000, 50)
price = 180 * sqft + 50000 + noise

# Visualize the data
plt.figure(figsize=(8, 5))
plt.scatter(sqft, price, alpha=0.7, c='steelblue', edgecolors='navy')
plt.xlabel('Square Feet', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('House Size vs Price (Raw Data)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Data shape:", sqft.shape)

What just happened: We created 50 synthetic houses with random sizes (800–3500 sqft) and prices following a linear trend plus noise. The scatter plot shows the relationship—bigger houses cost more, with natural variation. In real projects, this would come from your actual dataset.

5. Linear Regression from Scratch (NumPy)¶

The linear model is simple: \(y = w \cdot x + b\). Here \(w\) is the slope (how much price increases per sqft) and \(b\) is the intercept (base price). We find \(w\) and \(b\) by minimizing the average squared error between predictions and actual prices—this is called Mean Squared Error (MSE). The closed-form solution (normal equation) gives us the optimal values directly, no iteration needed.

Implementing the fit: The cell below uses the normal equation to find optimal \(w\) and \(b\) in one shot. No loops, no gradient descent—just matrix algebra.

Before we visualize: Let's plot the fitted line over the data to see how well it captures the relationship.

def linear_regression_fit(X, y):
    """Fit y = w*x + b using closed-form solution (normal equation)."""
    X = np.array(X).reshape(-1, 1)
    y = np.array(y)
    # Add column of ones for intercept
    X_b = np.c_[np.ones(len(X)), X]
    # (X^T X)^{-1} X^T y
    params = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    b, w = params[0], params[1]
    return w, b


w, b = linear_regression_fit(sqft, price)
print(f"Fitted model: price = {w:.1f} * sqft + {b:.0f}")

What just happened: We found optimal \(w\) and \(b\). The printed formula is your model—you can now predict any house price by plugging in sqft.

# Visualize: scatter plot + fitted line
plt.figure(figsize=(8, 5))
plt.scatter(sqft, price, alpha=0.7, c='steelblue', edgecolors='navy', label='Data')
x_line = np.linspace(sqft.min(), sqft.max(), 100)
y_pred = w * x_line + b
plt.plot(x_line, y_pred, 'r-', linewidth=2, label=f'Fitted: y = {w:.0f}x + {b:.0f}')
plt.xlabel('Square Feet', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Linear Regression: House Price Prediction', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What just happened: We plotted the fitted line (red) over the data. The line approximates the trend—some points are above, some below. You just built your first ML model! Let's understand what happened: we learned a relationship from 50 examples and can now predict prices for houses we've never seen.

Splitting the data: Below we randomly assign 80% of houses to training and 20% to testing. We fit only on the training set, then evaluate on both.

6. Train/Test Split: Why It Matters¶

You wouldn't let a student see the exam before taking it. If we evaluate on the same data we trained on, we get overly optimistic results (the model "memorized" the data).

Solution: Split data into train (fit the model) and test (evaluate on held-out data). The model never sees the test set until the final evaluation.

What do you think will happen? If we use ALL data for training, will our error on "new" houses be higher or lower than if we had held out a test set?

What just happened: We split the data, fit on 80%, and computed MSE on both sets. Test MSE is the honest measure—it tells us how well the model will perform on new houses. If train MSE is much lower than test MSE, the model may be overfitting.

# Train/test split (80/20)
np.random.seed(42)
indices = np.random.permutation(len(sqft))
split = int(0.8 * len(sqft))
train_idx, test_idx = indices[:split], indices[split:]

sqft_train, sqft_test = sqft[train_idx], sqft[test_idx]
price_train, price_test = price[train_idx], price[test_idx]

# Fit on train only
w, b = linear_regression_fit(sqft_train, price_train)

# Predict on train and test
pred_train = w * sqft_train + b
pred_test = w * sqft_test + b

mse_train = np.mean((pred_train - price_train) ** 2)
mse_test = np.mean((pred_test - price_test) ** 2)

print(f"Train MSE: ${mse_train:,.0f}")
print(f"Test MSE:  ${mse_test:,.0f}")
print(f"Train RMSE: ${np.sqrt(mse_train):,.0f}")
print(f"Test RMSE:  ${np.sqrt(mse_test):,.0f}")

Visualizing the split: The cell below plots train points (blue) and test points (coral squares) so you can see which houses the model learned from vs. which it's being evaluated on.

# Visualize train vs test points
plt.figure(figsize=(8, 5))
plt.scatter(sqft_train, price_train, alpha=0.7, c='steelblue', label='Train', s=60)
plt.scatter(sqft_test, price_test, alpha=0.9, c='coral', marker='s', label='Test (held out)', s=80)
x_line = np.linspace(sqft.min(), sqft.max(), 100)
plt.plot(x_line, w * x_line + b, 'r-', linewidth=2, label='Model')
plt.xlabel('Square Feet', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Train/Test Split: Model Evaluated on Unseen Data', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

What just happened: We split the data, fit on 80%, and computed MSE on both sets. Test MSE is the honest measure—it tells us how well the model will perform on new houses. If train MSE is much lower than test MSE, the model may be overfitting. The plot shows train (blue) vs test (coral) points—the model never saw the coral points during training.

Summary¶

Machine learning = learning from data instead of explicit rules
Supervised (labels) vs Unsupervised (no labels) vs Reinforcement (rewards)
ML Pipeline: Data → Clean → Features → Split → Train → Evaluate → Deploy
Linear regression predicts a continuous value: \(y = w x + b\)
Train/test split ensures we measure generalization, not memorization

Next: Feature engineering, cross-validation, and evaluation metrics.

Try it yourself: Change the train/test split to 70/30 and see how MSE changes. Try predicting a house with 2500 sqft using your fitted model.Common mistakes: Forgetting to split before fitting, or evaluating on the same data you trained on. Always hold out a test set!¶

Generated by Berta AI | Created by Luigi Pascal Rondanini

Back to Ch 6 overview | Try in Playground | View on GitHub