Skip to content

Ch 3: Linear Algebra & Calculus - Intermediate

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-03-linear-algebra/notebooks/02_intermediate.ipynb in Jupyter.


Chapter 3: Linear Algebra & Calculus

Notebook 02 - Intermediate

Matrices are how we represent batches of data, linear transformations, and neural network layers.

What you'll learn: - Matrices and matrix operations (multiply, transpose) - Identity matrix, inverse - Linear transformations as matrix multiplication - Introduction to NumPy for efficient computation - Practical: image as matrix, data as matrix

Time estimate: 3.5 hours


Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Why Matrices Matter for AI

A matrix is a grid of numbers. Like a spreadsheet: rows and columns. A grayscale image is a matrix—each cell is a pixel intensity (0=black, 255=white). A batch of 32 house feature vectors, each with 5 features, is a 32×5 matrix: 32 rows (samples), 5 columns (features). Matrices let you process many things at once: one matrix multiply does 32 dot products. That's the essence of batch processing in neural networks.

Structure AI Use
Matrix (2D) Batch of feature vectors, weight matrix, image (grayscale)
Tensor (3D+) Batch of images, sequence of embeddings, video
Linear layer y = XW + b — matrix multiply + bias
Attention scores = QKᵀ — dot products in matrix form
# Data as matrix: rows = samples, columns = features
# This is the standard layout in ML

X = [
    [1, 2, 3],   # sample 1
    [4, 5, 6],   # sample 2
    [7, 8, 9],   # sample 3
]

print("Data matrix X: 3 samples × 3 features")
for i, row in enumerate(X):
    print(f"  Sample {i+1}: {row}")

2. Matrix Operations (Pure Python First)

Transpose: swap rows and columns. Row 1 becomes column 1, row 2 becomes column 2. (Aᵀ)ᵢⱼ = Aⱼᵢ

Matrix multiply (step-by-step in words): To get the (i,j) entry of AB, take row i of A and column j of B, multiply corresponding elements, and sum. So each output cell is a dot product. For A (2×2) and B (2×2), you compute 4 dot products. The inner dimensions must match: A (m×k) × B (k×n) → (m×n). Matrix multiply is the core of every linear layer: each output neuron = dot product of one weight row with the input.

def matrix_transpose(A):
    """Transpose: rows become columns."""
    if not A:
        return []
    n_rows, n_cols = len(A), len(A[0])
    return [[A[i][j] for i in range(n_rows)] for j in range(n_cols)]


def matrix_multiply(A, B):
    """Matrix multiply: A (m×k) @ B (k×n) -> (m×n)."""
    m, k1 = len(A), len(A[0])
    k2, n = len(B), len(B[0])
    if k1 != k2:
        raise ValueError(f"Inner dims must match: {k1} vs {k2}")

    result = [[0] * n for _ in range(m)]
    for i in range(m):
        for j in range(n):
            result[i][j] = sum(A[i][kk] * B[kk][j] for kk in range(k1))
    return result


A = [[1, 2], [3, 4]]
B = [[5, 6], [7, 8]]

print("A =", A)
print("A^T =", matrix_transpose(A))
print("A @ B =", matrix_multiply(A, B))

3. Matrix-Vector Multiplication

A linear layer: y = Wx (plus bias). The weight matrix W transforms input x into output y.

Treat vector as column: W (m×n) @ x (n×1) → y (m×1)

What just happened

We transposed A (rows became columns) and computed A×B element by element. Each cell of the result is the dot product of the corresponding row of A and column of B. This is exactly what happens in a linear layer: the weight matrix W has one row per output neuron, and each row is dotted with the input vector.

def matrix_vector_multiply(A, x):
    """Matrix (m×n) times vector (n) -> vector (m). Core of linear layer."""
    return [sum(A[i][j] * x[j] for j in range(len(x))) for i in range(len(A))]


# Simulate a linear layer: 3 inputs -> 2 outputs
W = [[0.5, -0.3, 0.2], [0.1, 0.4, -0.5]]  # 2×3 weight matrix
x = [1.0, 2.0, 3.0]  # input vector

y = matrix_vector_multiply(W, x)
print(f"Linear layer: y = Wx")
print(f"  W (2×3) @ x (3) = {y}")

4. Identity and Inverse

Identity matrix I: 1s on diagonal, 0s elsewhere. A·I = I·A = A.

Inverse A⁻¹: A·A⁻¹ = I. Only exists for square, full-rank matrices. Used in least squares, some optimizers.

What "transforming" data means. Multiplying by a matrix changes the data: scale it (stretch/shrink), rotate it, or project it onto a lower dimension. Scaling multiplies each dimension by a factor. Rotation changes direction without changing length. In data augmentation for images, we apply rotation matrices to pixel coordinates. In neural networks, each linear layer applies a learned matrix—the network learns which transformations best map input to output.

def identity_matrix(n):
    """Create n×n identity matrix."""
    return [[1 if i == j else 0 for j in range(n)] for i in range(n)]


I = identity_matrix(3)
print("Identity I (3×3):")
for row in I:
    print(f"  {row}")

# I @ A = A
print(f"\nI @ X[0] = {matrix_vector_multiply(I, X[0])}")

5. Linear Transformations

Matrix multiplication = linear transformation: rotation, scaling, shearing, projection.

Example: 2D rotation by θ: R = [[cos θ, -sin θ], [sin θ, cos θ]]

import math


def rotation_matrix_2d(angle_degrees):
    """2D rotation matrix. Used in data augmentation (image rotation)."""
    theta = math.radians(angle_degrees)
    c, s = math.cos(theta), math.sin(theta)
    return [[c, -s], [s, c]]


# Rotate point (1, 0) by 90 degrees
R = rotation_matrix_2d(90)
point = [1, 0]
rotated = matrix_vector_multiply(R, point)
print(f"Rotate (1,0) by 90°: {rotated} (expect ~[0, 1])")

6. NumPy: Efficient Linear Algebra

Why NumPy exists. Pure Python is too slow for real ML. Python loops over arrays are interpreted, one element at a time. NumPy runs tight C loops and uses SIMD (single instruction, multiple data) to process many numbers in parallel. A dot product of 1M elements can be 50–100× faster in NumPy than in pure Python. Every major ML framework (PyTorch, TensorFlow) builds on this: arrays live in contiguous memory, operations are vectorized. For learning, we used pure Python; for real work, NumPy (and beyond) is essential.

What just happened

NumPy's syntax mirrors math: u + v, u @ v, np.linalg.norm(u). The @ operator is matrix multiplication. Under the hood, these dispatch to optimized C/Fortran code. Same operations we implemented in Python—now fast enough for millions of parameters.

import numpy as np

# Same operations, NumPy style
u = np.array([1, 2, 3])
v = np.array([4, 5, 6])

print("NumPy vector ops:")
print(f"  u + v   = {u + v}")
print(f"  2 * u   = {2 * u}")
print(f"  u @ v   = {np.dot(u, v)}")
print(f"  ||u||₂  = {np.linalg.norm(u)}")
# NumPy matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("NumPy matrix ops:")
print(f"A @ B =\n{A @ B}")
print(f"A.T =\n{A.T}")
print(f"I =\n{np.eye(3)}")

How computers see images. A grayscale image is a 2D matrix: each cell is one pixel (0=black, 255=white). Rows = height, columns = width. Convolutions and pooling operate on this grid. Color images add a third dimension: 3 channels (R, G, B), so a 100×100 RGB image is 100×100×3.

7. Practical: Image as Matrix

A grayscale image is a 2D matrix: rows × columns of pixel intensities (0-255).

Color image: 3D tensor (height × width × channels) or (channels × height × width) in PyTorch.

# Create a tiny 5×5 "image" (matrix)
image = np.array([
    [0, 0, 1, 0, 0],
    [0, 1, 1, 1, 0],
    [1, 1, 1, 1, 1],
    [0, 1, 1, 1, 0],
    [0, 0, 1, 0, 0],
])

print("5×5 'image' (simple pattern):")
print(image)
print(f"Shape: {image.shape}")
print(f"Flattened (as vector): {image.flatten()}")

What just happened

We created a 5×5 "image"—really just a matrix of 0s and 1s forming a simple diamond pattern. image.shape gives (5, 5). Flattening turns it into a 25-dimensional vector (one row of pixels). A real image (e.g., 224×224) would be 50,176 numbers as a vector—that's what goes into the first layer of a CNN.

# Visualize with matplotlib
import matplotlib.pyplot as plt

plt.imshow(image, cmap="gray")  # grayscale colormap
plt.title("Image as Matrix (5×5)")
plt.colorbar()
plt.show()

8. Practical: Data Matrix and Batch Processing

In ML, we process batches: (batch_size × features). One matrix multiply does all samples at once.

# Batch: 4 samples, 3 features each
X_batch = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [1, 1, 1],
])

# Weight matrix: 3 -> 2 (e.g., hidden layer)
W = np.array([[0.5, -0.3, 0.2], [0.1, 0.4, -0.5]])

# One matrix multiply: all 4 samples at once
# (4×3) @ (3×2) = (4×2) — W.T makes (2×3) into (3×2)
Y = X_batch @ W.T

print("Batch linear transform: Y = X @ W^T")
print(f"X: {X_batch.shape}, W: {W.shape}")
print(f"Y: {Y.shape}")
print(Y)

What just happened

We processed 4 samples in one matrix multiply: X_batch (4×3) @ W.T (3×2) = Y (4×2). Each row of Y is the linear layer output for that sample. This is batch processing—one operation instead of 4 separate dot products. Real training uses batches of 32, 64, or 256.

9. Tensor Basics

Tensor = generalized matrix to arbitrary dimensions: - 0D: scalar - 1D: vector - 2D: matrix - 3D: e.g., batch of images (B × H × W) - 4D: e.g., batch of color images (B × C × H × W)

# Batch of 2 grayscale images, 5×5 each
batch_images = np.random.rand(2, 5, 5)
print(f"Tensor shape: {batch_images.shape} (batch=2, height=5, width=5)")
print(f"Single image shape: {batch_images[0].shape}")

What's Next?

In Notebook 03 (Advanced), we'll cover: - Derivatives and partial derivatives - Gradients and gradient descent - Chain rule and backpropagation intuition - Capstone: implement gradient descent for linear regression from scratch


Generated by Berta AI | Created by Luigi Pascal Rondanini


Back to Ch 3 overview | Try in Playground | View on GitHub