Ch 2: Data Structures & Algorithms - Advanced¶

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-02-data-structures/notebooks/03_advanced.ipynb in Jupyter.

Chapter 2: Data Structures & Algorithms¶

Notebook 03 - Advanced¶

Trees, graphs, and dynamic programming: the structures behind neural networks, knowledge graphs, and optimization.

What you'll learn: - Binary trees and traversals - Graphs: BFS and DFS - Dynamic programming fundamentals - Capstone: building a simple text search index

Time estimate: 1.5 hours

Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Trees¶

Think of a family tree or an organization chart. One root at the top, branches below, each node (except leaves) has children. There are no cycles—you can't go down and loop back to where you started. Trees are hierarchical structures: parent-child relationships, one "level" at a time.

Why decision trees use this structure. A decision tree asks yes/no questions: "Is feature X > 5?" If yes, go left; if no, go right. Each internal node is a question; each leaf is a prediction. The tree we build below mimics this: "accuracy > 0.9?" → deploy or consider retraining. Trees also appear in parse trees (NLP: how words group into phrases), file systems, and JSON—any nested, one-parent-per-node structure.

class TreeNode:
    """A binary tree node."""
    def __init__(self, value, left=None, right=None):
        self.value = value
        self.left = left
        self.right = right


def inorder(node):
    """Left -> Root -> Right (gives sorted order for BST)."""
    if node is None:
        return []
    return inorder(node.left) + [node.value] + inorder(node.right)


def preorder(node):
    """Root -> Left -> Right (used to serialize trees)."""
    if node is None:
        return []
    return [node.value] + preorder(node.left) + preorder(node.right)


def tree_height(node):
    """Height of a tree (depth of deepest leaf)."""
    if node is None:
        return 0
    return 1 + max(tree_height(node.left), tree_height(node.right))


# Build a simple decision tree structure
#        accuracy > 0.9?
#       /              \
#    deploy          retrain?
#                   /       \
#              more_data   change_model

tree = TreeNode("accuracy > 0.9?",
    left=TreeNode("deploy"),
    right=TreeNode("retrain?",
        left=TreeNode("more_data"),
        right=TreeNode("change_model")
    )
)

print(f"In-order:  {inorder(tree)}")
print(f"Pre-order: {preorder(tree)}")
print(f"Height:    {tree_height(tree)}")

What just happened¶

We defined a TreeNode with value, left, and right children. In-order traversal (left → root → right) visits nodes in sorted order for a binary search tree. Pre-order (root → left → right) is used to serialize trees or copy structure. Height counts the longest path from root to leaf. Our toy decision tree shows a deploy/retrain workflow: the tree answers "accuracy > 0.9?" and branches accordingly.

2. Graphs¶

Think of a social network. People are nodes; friendships are edges connecting them. Unlike trees, graphs can have cycles (Alice knows Bob who knows Carol who knows Alice) and multiple connections. Graphs model any system of relationships: knowledge graphs (entities and relations), neural network architectures (layers and connections), pipeline dependencies (step A feeds into step B).

BFS vs DFS—when to use each. BFS (breadth-first) explores level by level: all neighbors at distance 1, then distance 2, and so on. Use BFS when you need the shortest path or want to process nodes in order of distance. DFS (depth-first) goes as deep as possible before backtracking. Use DFS for exhaustive exploration, cycle detection, or when the structure is naturally recursive (e.g., dependency resolution).

from collections import deque


class Graph:
    """A directed graph using adjacency list."""

    def __init__(self):
        self.edges = {}  # node -> [neighbors]

    def add_edge(self, src, dst):
        self.edges.setdefault(src, []).append(dst)
        self.edges.setdefault(dst, [])  # ensure dst exists

    def bfs(self, start):
        """Breadth-first search: explore level by level."""
        visited = set()
        queue = deque([start])
        order = []

        while queue:
            node = queue.popleft()
            if node not in visited:
                visited.add(node)
                order.append(node)
                for neighbor in self.edges.get(node, []):
                    if neighbor not in visited:
                        queue.append(neighbor)
        return order

    def dfs(self, start, visited=None):
        """Depth-first search: explore as deep as possible first."""
        if visited is None:
            visited = set()

        visited.add(start)
        order = [start]

        for neighbor in self.edges.get(start, []):
            if neighbor not in visited:
                order.extend(self.dfs(neighbor, visited))
        return order

    def has_path(self, src, dst):
        """Check if there's a path from src to dst."""
        return dst in self.bfs(src)


# Model a simple ML pipeline dependency graph
pipeline = Graph()
pipeline.add_edge("raw_data", "clean_data")
pipeline.add_edge("clean_data", "features")
pipeline.add_edge("clean_data", "labels")
pipeline.add_edge("features", "train_model")
pipeline.add_edge("labels", "train_model")
pipeline.add_edge("train_model", "evaluate")
pipeline.add_edge("evaluate", "deploy")

print("Pipeline BFS (execution order):")
for step in pipeline.bfs("raw_data"):
    print(f"  -> {step}")

print(f"\nPath raw_data -> deploy? {pipeline.has_path('raw_data', 'deploy')}")
print(f"Path deploy -> raw_data? {pipeline.has_path('deploy', 'raw_data')}")

What just happened¶

We built a directed graph with an adjacency list (each node maps to its neighbors). BFS uses a queue to process nodes level by level—notice "raw_data" → "clean_data" → "features"/"labels" → "train_model" → "evaluate" → "deploy" follows the pipeline order. DFS would go deep down one path first. has_path checks if you can reach one node from another; "deploy → raw_data" is false because the graph is directed.

3. Dynamic Programming¶

DP breaks big problems into smaller ones and remembers the answers. Instead of recomputing the same subproblem over and over, you store results in a table. When you need "what's the answer for size 5?" you look it up instead of recomputing. Classic example: Fibonacci. F(5) = F(4) + F(3), and F(4) needs F(3) and F(2)—F(3) gets computed many times without DP. With a table, each F(k) is computed once.

In AI: sequence alignment (NLP), Viterbi algorithm (HMMs), optimal policy computation (RL).

Edit distance (Levenshtein): what it means. How many single-character edits (insert, delete, substitute) does it take to turn one word into another? "machin" → "machine" needs one insert (e). "king" → "queen" needs four substitutions. Spell checkers use this to find the closest word in the dictionary. The algorithm fills a table where each cell = minimum edits to transform the prefix of s1 into the prefix of s2.

def edit_distance(s1, s2):
    """Levenshtein distance: min edits to transform s1 into s2.

    Used in NLP for: spell checking, fuzzy matching, sequence alignment.
    """
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],    # deletion
                    dp[i][j-1],    # insertion
                    dp[i-1][j-1],  # substitution
                )

    return dp[m][n]


# Spell checking: find closest word
vocabulary = ["machine", "learning", "deep", "neural", "network",
              "training", "dataset", "model", "algorithm", "tensor"]

misspelled = "machin"
distances = [(word, edit_distance(misspelled, word)) for word in vocabulary]
distances.sort(key=lambda x: x[1])

print(f"Spell check for '{misspelled}':")
for word, dist in distances[:5]:
    print(f"  {word:>12}: {dist} edit(s)")

# Similarity between model names
pairs = [("gpt-4", "gpt-3"), ("bert", "bart"), ("llama", "alpaca")]
for a, b in pairs:
    print(f"\n  '{a}' vs '{b}': {edit_distance(a, b)} edits")

What just happened¶

The edit distance table is built bottom-up: base cases (empty string to any string = length of that string), then each cell uses the minimum of (delete from s1, insert into s1, or substitute). We found the closest vocabulary word to "machin" (machine, 1 edit) and compared model names. This same algorithm underlies fuzzy matching, DNA sequence alignment, and diff tools.

4. Capstone: Simple Text Search Index¶

What we're building. A tiny search engine: you give it documents, then query with a few words, and it returns the most relevant documents ranked by a score. This is the same idea that powers Google, Elasticsearch, and—crucially for AI—RAG (Retrieval-Augmented Generation). RAG systems first retrieve relevant documents from a large corpus, then feed them to an LLM as context. The retrieval step is exactly this: an index that maps words to documents, with a scoring function (here, TF-IDF) to rank by relevance.

Why it matters. Without good retrieval, RAG would either miss relevant context or drown the model in irrelevant text. The index combines hash tables (fast lookup: which documents contain "neural"?), term frequency (how often does the word appear in the doc?), and inverse document frequency (how rare is the word across all docs?). Rare, frequent-in-doc words score highest.

from collections import defaultdict
import math


class SearchIndex:
    """A simple inverted index for text search (TF-IDF based)."""

    def __init__(self):
        self.index = defaultdict(list)  # word -> [(doc_id, frequency)]
        self.documents = {}             # doc_id -> text
        self.doc_lengths = {}           # doc_id -> word count

    def add_document(self, doc_id, text):
        """Index a document."""
        self.documents[doc_id] = text
        words = text.lower().split()
        self.doc_lengths[doc_id] = len(words)

        word_freq = defaultdict(int)
        for word in words:
            clean = word.strip(".,!?;:'\"")
            if clean:
                word_freq[clean] += 1

        for word, freq in word_freq.items():
            self.index[word].append((doc_id, freq))

    def search(self, query, top_k=5):
        """Search documents using TF-IDF scoring."""
        query_words = query.lower().split()
        scores = defaultdict(float)
        n_docs = len(self.documents)

        for word in query_words:
            if word not in self.index:
                continue

            doc_freq = len(self.index[word])
            idf = math.log(n_docs / (1 + doc_freq))

            for doc_id, freq in self.index[word]:
                tf = freq / max(self.doc_lengths[doc_id], 1)
                scores[doc_id] += tf * idf

        ranked = sorted(scores.items(), key=lambda x: -x[1])
        return ranked[:top_k]

    def stats(self):
        return {
            "documents": len(self.documents),
            "unique_terms": len(self.index),
            "avg_doc_length": sum(self.doc_lengths.values()) / max(len(self.doc_lengths), 1),
        }


# Build a mini search engine for AI topics
index = SearchIndex()

docs = {
    "d1": "Neural networks learn patterns from training data using backpropagation",
    "d2": "Transformers use attention mechanisms for natural language processing",
    "d3": "Reinforcement learning trains agents through rewards and exploration",
    "d4": "RAG systems retrieve relevant documents to augment language model generation",
    "d5": "Fine-tuning adapts pre-trained models to specific domains and tasks",
    "d6": "Convolutional neural networks excel at image recognition and processing",
    "d7": "Large language models generate text using transformer architectures",
    "d8": "Data preprocessing is essential for training accurate machine learning models",
}

for doc_id, text in docs.items():
    index.add_document(doc_id, text)

print(f"Index stats: {index.stats()}")

queries = ["neural network training", "language models", "reinforcement rewards"]
for query in queries:
    results = index.search(query, top_k=3)
    print(f"\nQuery: '{query}'")
    for doc_id, score in results:
        print(f"  [{score:.4f}] {doc_id}: {docs[doc_id][:60]}...")

Chapter 2 Complete!¶

You've mastered the fundamental data structures and algorithms used in AI:

Notebook 01: Arrays, Big O, linear/binary search, two-pointer, sliding window
Notebook 02: Stacks, queues, hash tables, merge sort, quicksort
Notebook 03: Trees, graphs, BFS/DFS, dynamic programming, search index

Next up: Chapter 3: Linear Algebra & Calculus - the math that powers neural networks.

Generated by Berta AI | Created by Luigi Pascal Rondanini

What just happened¶

We built a SearchIndex with an inverted index: each word maps to (doc_id, frequency) pairs. search() scores documents using TF-IDF (term frequency × inverse document frequency) and returns the top-k. Queries like "neural network training" match docs with those terms; rare terms contribute more. This is the retrieval backbone of RAG—same ideas, scaled to millions of documents.

Back to Ch 2 overview | Try in Playground | View on GitHub