Ch 2: Data Structures & Algorithms - Intermediate¶

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-02-data-structures/notebooks/02_intermediate.ipynb in Jupyter.

Chapter 2: Data Structures & Algorithms¶

Notebook 02 - Intermediate¶

Stacks, queues, hash tables, and sorting - the workhorses of efficient code.

What you'll learn: - Stacks and queues (LIFO/FIFO) - Hash tables and collision handling - Sorting algorithms: merge sort, quicksort - Choosing the right data structure

Time estimate: 2 hours

Generated by Berta AI | Created by Luigi Pascal Rondanini

1. Stacks (LIFO)¶

Think of a stack of plates. You add plates on top; when you need one, you take from the top. The last plate you put on is the first one you take off—that's Last-In-First-Out (LIFO). You never grab from the middle. Stacks appear everywhere in computing: your browser's back button (the last page you visited is the first you return to), undo in a text editor (last action undone first), and the call stack when a function calls another function—the most recent call is the first to complete and return.

Used for: undo operations, backtracking, parsing expressions, DFS, call stacks in recursion.

Balanced parentheses: why it matters. Every time you write if x > 0: or layers[0].forward(), the computer must check that brackets match. Parsers for code, JSON, and config files do exactly this. A stack is perfect: push each opening (, [, {; when you see a closing one, pop and check it matches. If the stack is empty when you need to pop, or has leftovers at the end, the expression is broken.

class Stack:
    """A stack implementation using a Python list."""

    def __init__(self):
        self._items = []

    def push(self, item):
        self._items.append(item)

    def pop(self):
        if self.is_empty():
            raise IndexError("Pop from empty stack")
        return self._items.pop()

    def peek(self):
        if self.is_empty():
            raise IndexError("Peek at empty stack")
        return self._items[-1]

    def is_empty(self):
        return len(self._items) == 0

    def __len__(self):
        return len(self._items)

    def __repr__(self):
        return f"Stack({self._items})"


# Practical: balanced parentheses checker
# (used in expression parsing, code validation)
def is_balanced(expression):
    """Check if brackets in an expression are balanced."""
    stack = Stack()
    pairs = {')': '(', ']': '[', '}': '{'}

    for char in expression:
        if char in '([{':
            stack.push(char)
        elif char in ')]}':
            if stack.is_empty() or stack.pop() != pairs[char]:
                return False

    return stack.is_empty()


tests = [
    "model(layers[0].forward({input: x}))",
    "loss = (pred - target) ** 2",
    "broken([)",
    "{config: {lr: 0.01, epochs: [10, 20]}}",
]

for expr in tests:
    result = 'balanced' if is_balanced(expr) else 'UNBALANCED'
    print(f"  {result:>12}: {expr}")

What just happened¶

We built a Stack class using a Python list (append = push, pop = pop). Then we used it to validate brackets: opening brackets go on the stack, closing brackets must match the top of the stack. The expressions model(layers[0].forward(...)) and {config: {...}} are balanced; broken([) fails because [ was never closed before ( was closed. This same logic runs inside every parser that processes nested structures.

2. Queues (FIFO)¶

Think of a line at a store. The first person in line is the first served. New people join at the back. That's First-In-First-Out (FIFO). Queues ensure fairness: tasks submitted first get processed first. In ML, you queue training jobs, inference requests, and data batches. BFS (breadth-first search) uses a queue: explore neighbors level by level, so you find the shortest path. Message queues for distributed systems work the same way—messages are processed in order.

Watch out: stack vs queue. Both hold items and let you add/remove. The difference is which item comes out: stack = most recent (LIFO), queue = oldest (FIFO). Use a stack for undo, backtracking, or "go back." Use a queue for "process in order" or "first come, first served."

from collections import deque


class TaskQueue:
    """A priority-aware task queue for ML job scheduling."""

    def __init__(self):
        self.high = deque()
        self.normal = deque()
        self.low = deque()

    def enqueue(self, task, priority="normal"):
        q = {"high": self.high, "normal": self.normal, "low": self.low}[priority]
        q.append(task)

    def dequeue(self):
        for q in [self.high, self.normal, self.low]:
            if q:
                return q.popleft()
        return None

    @property
    def size(self):
        return len(self.high) + len(self.normal) + len(self.low)


# Simulate an ML job scheduler
scheduler = TaskQueue()
scheduler.enqueue("train_model_v2", "normal")
scheduler.enqueue("fix_data_pipeline", "high")
scheduler.enqueue("generate_report", "low")
scheduler.enqueue("retrain_urgent", "high")
scheduler.enqueue("evaluate_model", "normal")

print("Processing ML jobs:")
while scheduler.size > 0:
    job = scheduler.dequeue()
    print(f"  Running: {job}")

What just happened¶

We used Python's deque (double-ended queue) from collections for O(1) append and popleft. The TaskQueue has three priority levels; dequeue always takes from high first, then normal, then low. So "fix_data_pipeline" and "retrain_urgent" run before "train_model_v2" and "evaluate_model," which run before "generate_report." Real ML schedulers (e.g., Kubernetes, Ray) use similar priority queues.

3. Sorting Algorithms¶

Sorting is fundamental to data processing. Understanding how sorting works helps you choose the right approach for your data.

Merge sort (plain English): Imagine sorting a deck of cards by dividing it in half, sorting each half separately (by recursively dividing again until you have single cards), then merging the two sorted halves. To merge, you look at the top card of each pile and always take the smaller one. Merge sort always does O(n log n) work—predictable and stable (equal elements keep their relative order). Used when you need guaranteed performance.

Quicksort (plain English): Pick a "pivot" (e.g., the middle element). Put everything smaller left, everything larger right, pivot in the middle. Now recursively sort the left and right halves. On average, O(n log n); in the worst case (bad pivot choices), O(n²). Often faster in practice due to better cache behavior and simpler inner loops. Python's sorted() uses Timsort, a hybrid of merge sort and insertion sort optimized for real-world data.

def merge_sort(arr):
    """O(n log n) sorting - divide and conquer. Stable sort."""
    if len(arr) <= 1:
        return arr

    mid = len(arr) // 2
    left = merge_sort(arr[:mid])
    right = merge_sort(arr[mid:])

    return merge(left, right)


def merge(left, right):
    """Merge two sorted lists into one sorted list."""
    result = []
    i = j = 0

    while i < len(left) and j < len(right):
        if left[i] <= right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1

    result.extend(left[i:])
    result.extend(right[j:])
    return result


def quicksort(arr):
    """O(n log n) average sorting - partition-based. In-place possible."""
    if len(arr) <= 1:
        return arr

    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]

    return quicksort(left) + middle + quicksort(right)


# Demonstrate
import random
random.seed(42)
data = [random.randint(0, 100) for _ in range(15)]

print(f"Original:   {data}")
print(f"Merge sort: {merge_sort(data)}")
print(f"Quicksort:  {quicksort(data)}")
print(f"Built-in:   {sorted(data)}")

# Performance comparison
import time
large_data = [random.randint(0, 1_000_000) for _ in range(10_000)]

for name, func in [("merge_sort", merge_sort), ("quicksort", quicksort), ("sorted()", sorted)]:
    start = time.perf_counter()
    func(large_data)
    elapsed = time.perf_counter() - start
    print(f"  {name:>12}: {elapsed*1000:.2f} ms on {len(large_data):,} items")

What just happened¶

All three methods produce the same sorted output. The performance comparison shows milliseconds for 10,000 items. Why is Python's sorted() so much faster? It's implemented in C (not Python), uses Timsort (optimized for partially sorted data), and avoids the overhead of function calls and list operations that our pure-Python merge sort and quicksort incur. In practice, always use sorted() or .sort()—understanding the algorithms helps you reason about complexity, but the built-in is the right tool.

4. Choosing the Right Data Structure¶

Problem	Best Structure	Why
Store training batches in order	List	Index access, append
Fast lookup of labels	Dict/Set	O(1) lookup
Processing jobs in order	Queue (deque)	FIFO
Undo/redo history	Stack	LIFO
Unique vocabulary	Set	Auto-dedup
Model config	Dict	Key-value pairs
Fixed-shape data	Tuple	Immutable, hashable
Priority scheduling	Heap/PriorityQueue	Efficient min/max

What just happened¶

We used three Python structures for common ML tasks: Counter for word frequencies (NLP vocab), defaultdict to group predictions by class (e.g., for per-class metrics), and heapq.nlargest for top-K selection (recommendation systems, retrieval). Each choice matches the operation you need—counting, grouping, or "give me the K best." Try it yourself: Add more words to text and see how the vocabulary changes. Change top_3 to top_5 in the heap example.

# Real-world pattern: choosing structures for an ML pipeline

from collections import Counter, defaultdict, OrderedDict
import heapq

# Counter for word frequencies (NLP)
text = "the model predicts the label and the model learns from the data"
vocab = Counter(text.split())
print(f"Vocabulary: {dict(vocab.most_common(5))}")

# defaultdict for grouping (clustering results)
predictions = [("cat", 0.9), ("dog", 0.8), ("cat", 0.7), ("bird", 0.6), ("dog", 0.95)]
by_class = defaultdict(list)
for label, confidence in predictions:
    by_class[label].append(confidence)

print(f"\nGrouped predictions:")
for label, scores in by_class.items():
    print(f"  {label}: {scores} (avg: {sum(scores)/len(scores):.2f})")

# Heap for top-K selection (common in recommendation systems)
all_scores = [(0.95, "model_A"), (0.87, "model_B"), (0.92, "model_C"),
              (0.89, "model_D"), (0.91, "model_E"), (0.88, "model_F")]

top_3 = heapq.nlargest(3, all_scores)
print(f"\nTop 3 models: {[(name, f'{score:.0%}') for score, name in top_3]}")

What's Next?¶

In Notebook 03 (Advanced), we'll cover: - Trees and tree traversals - Graphs and graph algorithms (BFS, DFS) - Dynamic programming basics - Capstone: building a simple search index

Generated by Berta AI | Created by Luigi Pascal Rondanini

Back to Ch 2 overview | Try in Playground | View on GitHub