Ch 3: Linear Algebra & Calculus - Introduction¶

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-03-linear-algebra/notebooks/01_introduction.ipynb in Jupyter.

Chapter 3: Linear Algebra & Calculus for Machine Learning¶

Notebook 01 - Introduction¶

Vectors are the fundamental building blocks of machine learning. Every feature vector, embedding, and neural network weight lives in vector space.

What you'll learn: - Vectors and vector operations (addition, scalar multiply, dot product) - Vector norms (L1, L2) - Practical examples: feature vectors, word embedding similarity - Pure Python implementations to build intuition (no NumPy yet)

Time estimate: 3 hours

Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Why Vectors Matter for AI¶

What vectors represent in ML. A vector is a list of numbers that captures one "thing"—a house, a word, a user. A feature vector describes a data point: a house might be [1200 sqft, 3 bedrooms, 2 baths, 10 years old, 5.2 km to city]. That's one point in 5D space. Word embeddings map words to vectors so that "king" and "queen" are close (similar vectors) while "king" and "car" are far. In Word2Vec that's 300 numbers per word; in BERT it's 768. Every layer of a neural network takes vectors in and puts vectors out. When you pass an image through a CNN, you're really passing a big vector (flattened pixels) through a sequence of transformations. Understanding vectors is understanding how ML "sees" data.

In machine learning, almost everything is a vector: - Feature vectors: Each data point (e.g., a house with 5 features) → 5D vector - Word embeddings: Words mapped to dense vectors (e.g., 300 dimensions in Word2Vec) - Model weights: Neural network parameters are vectors and matrices - Predictions: Logits, probabilities, outputs are vectors

Understanding vector operations is essential for implementing and debugging ML algorithms.

# Example: a feature vector for a house (price prediction)
# [sqft, bedrooms, bathrooms, age, distance_to_city]
house_a = [1200, 3, 2, 10, 5.2]  # 5-dimensional feature vector
house_b = [1800, 4, 3, 5, 2.1]

print("Feature vectors in ML:")
print(f"  House A: {house_a}")
print(f"  House B: {house_b}")
print(f"  Dimension: {len(house_a)} features")

What just happened¶

We defined two houses as 5-dimensional feature vectors. Each number is a feature (sqft, bedrooms, etc.). In a price prediction model, the input would be a vector like this; the model learns weights to combine them into a predicted price. Try it yourself: Add a sixth feature (e.g., "has garage") to both vectors and adjust the dimension print.

2. Vector Operations¶

Three core operations we need:

Operation	Formula	AI Use Case
Addition	u + v = (u₁+v₁, u₂+v₂, ...)	Combining embeddings, batch updates
Scalar multiply	c·u = (c·u₁, c·u₂, ...)	Learning rate scaling, normalization
Dot product	u·v = Σ uᵢ·vᵢ	Similarity, attention scores, linear layer output

def vector_add(u, v):
    """Add two vectors element-wise. Used when combining representations."""
    if len(u) != len(v):
        raise ValueError("Vectors must have same dimension")
    return [a + b for a, b in zip(u, v)]


def vector_scale(c, u):
    """Multiply vector by scalar. Used for learning rate, normalization."""
    return [c * x for x in u]


def vector_dot(u, v):
    """Dot product: sum of element-wise products. Core of linear layers & similarity."""
    if len(u) != len(v):
        raise ValueError("Vectors must have same dimension")
    return sum(a * b for a, b in zip(u, v))


# Demo
u = [1, 2, 3]
v = [4, 5, 6]

print("Vector operations (pure Python):")
print(f"  u + v      = {vector_add(u, v)}")
print(f"  2 * u      = {vector_scale(2, u)}")
print(f"  u · v      = {vector_dot(u, v)}")
print(f"  (1*4 + 2*5 + 3*6 = {1*4+2*5+3*6})")

What just happened¶

We implemented vector_add, vector_scale, and vector_dot in pure Python. The dot product 1×4 + 2×5 + 3×6 = 32 measures how much the vectors "align." In a linear layer, each neuron computes a dot product between its weight vector and the input. Try it yourself: Compute vector_dot(house_a, house_b)—what does a high or low value mean for house similarity?

3. Dot Product: The Heart of ML¶

The dot product measures alignment between vectors: - Positive: vectors point in similar direction (similar) - Zero: vectors are perpendicular (uncorrelated) - Negative: vectors point opposite (dissimilar)

In neural networks: output = weights · input — a single neuron computes a dot product!

# Simulated word embeddings (3D for visualization)
# In real NLP: 300D (Word2Vec) or 768D (BERT)

word_embeddings = {
    "king":    [0.8, 0.3, 0.1],
    "queen":   [0.7, 0.4, 0.2],
    "man":     [0.6, 0.2, 0.1],
    "woman":   [0.5, 0.3, 0.2],
    "car":     [0.1, 0.1, 0.9],
}

def similarity(u, v):
    """Dot product as similarity (higher = more similar)."""
    return vector_dot(u, v)

print("Word embedding similarity (dot product):")
pairs = [("king", "queen"), ("king", "man"), ("king", "car")]
for w1, w2 in pairs:
    sim = similarity(word_embeddings[w1], word_embeddings[w2])
    print(f"  {w1} · {w2} = {sim:.3f}")

What just happened¶

We compared word embeddings using the dot product as similarity. "king" · "queen" is higher than "king" · "car" because the vectors for related words point in similar directions. This is how semantic search works: convert a query to a vector, then find documents whose embeddings have high dot product with it. Try it yourself: Add a new word (e.g., "prince") to the embeddings dict and compare its similarity to "king" and "queen."

4. Vector Norms¶

A norm measures the "length" or "magnitude" of a vector.

Norm	Formula	Use Case
L1 (Manhattan)	‖u‖₁ = Σ	uᵢ
L2 (Euclidean)	‖u‖₂ = √(Σ uᵢ²)	Distance, normalization, L2 regularization

L2 norm is the standard geometric length. Used in: cosine similarity, gradient clipping, weight decay.

Why cosine similarity matters. The dot product grows with vector length—a long document will have a larger dot product with a query even if a short one is more relevant. Cosine similarity divides by both lengths, so you compare direction only. Search engines rank documents this way; recommendation systems find similar users. In NLP, it's the standard for comparing word embeddings.

import math


def norm_l1(u):
    """L1 norm: sum of absolute values."""
    return sum(abs(x) for x in u)


def norm_l2(u):
    """L2 norm: Euclidean length = sqrt of sum of squares."""
    return math.sqrt(sum(x * x for x in u))


# Demo
v = [3, 4]
print(f"Vector v = {v}")
print(f"  L1 norm: {norm_l1(v)}")
print(f"  L2 norm: {norm_l2(v)} (3-4-5 triangle: sqrt(9+16)=5)")

5. Cosine Similarity¶

Dot product alone depends on magnitude. Cosine similarity normalizes by vector lengths:

\[\text{cosine}(u, v) = \frac{u \cdot v}{\|u\|_2 \cdot \|v\|_2} = \frac{u \cdot v}{\|u\| \|v\|}\]

Range: -1 (opposite) to +1 (identical direction). Most common similarity metric in NLP for embeddings.

def cosine_similarity(u, v):
    """Cosine similarity: dot product normalized by L2 norms."""
    dot = vector_dot(u, v)
    norm_u = norm_l2(u)
    norm_v = norm_l2(v)
    if norm_u == 0 or norm_v == 0:
        return 0.0
    return dot / (norm_u * norm_v)


# Compare embeddings with cosine similarity
print("Cosine similarity (direction only, ignores magnitude):")
for w1, w2 in [("king", "queen"), ("king", "car"), ("queen", "woman")]:
    u, v = word_embeddings[w1], word_embeddings[w2]
    cos = cosine_similarity(u, v)
    print(f"  {w1} vs {w2}: {cos:.3f}")

6. Practical: Feature Vector Distance¶

In k-NN, clustering, and retrieval: we find "nearest" points using Euclidean distance:

\[d(u, v) = \|u - v\|_2 = \sqrt{\sum_i (u_i - v_i)^2}\]

def vector_subtract(u, v):
    """Subtract vectors: u - v."""
    return [a - b for a, b in zip(u, v)]


def euclidean_distance(u, v):
    """L2 distance between two points. Used in k-NN, k-means."""
    diff = vector_subtract(u, v)
    return norm_l2(diff)


# Find closest house to a query
query = [1500, 3, 2, 8, 3.0]
houses = {
    "A": house_a,
    "B": house_b,
}

print("Euclidean distance from query (closest = most similar features):")
for name, features in houses.items():
    dist = euclidean_distance(query, features)
    print(f"  House {name}: {dist:.2f}")

What just happened¶

We computed Euclidean distance between the query house and each option. The closer the features (sqft, bedrooms, etc.), the smaller the distance. k-NN finds the k nearest neighbors this way; k-means assigns points to the cluster whose center is closest. Try it yourself: Create a third house and find which of A or B is closer to it.

7. Summary¶

You've implemented from scratch: - vector_add, vector_scale, vector_dot — the core operations - norm_l1, norm_l2 — vector magnitude - cosine_similarity — the standard for embedding similarity in NLP - euclidean_distance — used in k-NN, clustering

All in pure Python! Next notebook: matrices and NumPy for efficient computation.

Generated by Berta AI | Created by Luigi Pascal Rondanini

Back to Ch 3 overview | Try in Playground | View on GitHub