Skip to content

Ch 1: Python Fundamentals for AI - Intermediate

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-01-python-fundamentals/notebooks/02_intermediate.ipynb in Jupyter.


Chapter 1: Python Fundamentals for AI

Notebook 02 — Intermediate

Building on the basics, this notebook dives into the data structures and functions that form the backbone of every AI codebase.

What you’ll learn: - Collections in depth: lists, dicts, sets, tuples - Functions: parameters, returns, scope, decorators intro - Error handling (try/except) - Modules and imports - Practical patterns used in ML codebases

Time estimate: 2.5 hours


Generated by Berta AI | Created by Luigi Pascal Rondanini


1. Lists in Depth

Lists are Python’s most versatile collection type. Think of a list as a numbered shelf where each slot holds one item. The items can be anything — numbers, strings, even other lists — and the shelf can grow or shrink as needed.

A crucial property of lists is that they are mutable: you can change their contents after creation. This is different from strings, which are immutable (you can’t change a character inside a string; you have to create a new string). Mutability makes lists perfect for situations where data accumulates over time, like recording metrics during training or building up a batch of predictions.

In AI, you’ll use lists for: datasets, feature vectors, batches of predictions, model outputs, training histories, and much more. Let’s explore the key operations.

# Creating and manipulating lists
predictions = [0.9, 0.3, 0.7, 0.85, 0.2, 0.95, 0.6]

# Threshold to convert probabilities to binary predictions
threshold = 0.5
binary_preds = [1 if p >= threshold else 0 for p in predictions]

print(f"Probabilities: {predictions}")
print(f"Binary (t={threshold}): {binary_preds}")
print(f"Positive predictions: {sum(binary_preds)} / {len(binary_preds)}")

# Sorting
sorted_desc = sorted(predictions, reverse=True)
print(f"\nTop 3 confidence: {sorted_desc[:3]}")

# List operations
train_data = [1, 2, 3, 4, 5]
val_data = [6, 7]
all_data = train_data + val_data  # Concatenation
print(f"\nAll data: {all_data}")

# Unpacking
first, *middle, last = all_data
print(f"First: {first}, Middle: {middle}, Last: {last}")

What just happened?

This cell demonstrates several list patterns you’ll use daily:

  • Thresholding: We converted a list of probability scores into binary predictions (0 or 1). This is exactly what happens after a classification model produces outputs — you choose a threshold (commonly 0.5) and label everything above it as “positive.”
  • Sorting: sorted() returns a new sorted list without modifying the original. The reverse=True argument puts the highest values first, which is useful for finding top-K predictions.
  • Concatenation: The + operator joins two lists end-to-end. This is how you might combine training and validation data.
  • Unpacking: The * in first, *middle, last = all_data grabs everything between the first and last elements. This is a very Pythonic way to separate head, body, and tail.

Common Mistake — sort() vs sorted():
list.sort() sorts the list in place and returns None. sorted(list) returns a new sorted list and leaves the original unchanged. If you write result = my_list.sort(), result will be None — a very common bug!

Common List Patterns in ML Code

Two functions you’ll see everywhere in ML code are zip() and enumerate(). They solve two fundamental problems:

  • zip() pairs up elements from multiple lists, like matching features with their importance scores or inputs with their labels.
  • enumerate() gives you both the index and the value as you loop, which is essential for batch processing where you need to know which batch you’re on.
# Common list patterns in ML code

# zip: pair up parallel lists (features with labels)
features = ["size", "color", "weight", "shape"]
importances = [0.35, 0.15, 0.30, 0.20]

print("Feature Importances:")
for feature, importance in sorted(zip(features, importances), key=lambda x: -x[1]):
    bar = '#' * int(importance * 40)
    print(f"  {feature:>8}: {importance:.0%} {bar}")

# enumerate: index + value (for batch processing)
batches = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
print("\nProcessing batches:")
for batch_idx, batch in enumerate(batches):
    batch_mean = sum(batch) / len(batch)
    print(f"  Batch {batch_idx}: {batch} -> mean={batch_mean:.0f}")

What just happened?

zip(features, importances) created pairs like ("size", 0.35), ("color", 0.15), etc. We then sorted those pairs by importance (descending) and printed a visual bar chart. This is a common way to display feature importances after training a model — it tells you which features the model relies on most.

The enumerate example simulates processing data in batches. Each batch is a list of values, and enumerate tells us which batch number we’re on. In real training loops, you’d compute the loss for each batch and update the model weights.

✍️ Try it yourself

Change the threshold in the first code cell of this section to 0.7. How does the number of positive predictions change? In real AI work, choosing the right threshold is an important decision that trades off precision (avoiding false positives) against recall (catching all true positives).


2. Dictionaries

A dictionary maps keys to values, just like a real dictionary maps words to their definitions. You look up a key, and Python instantly returns the associated value. Under the hood, dictionaries use a data structure called a hash table, which makes lookups extremely fast — no matter how many items are in the dictionary.

In AI, dictionaries are everywhere: - Model configurations: mapping parameter names to their values (learning rate, batch size, etc.). - Training histories: mapping metric names to lists of values across epochs. - JSON data: almost every API response and config file is a dictionary. - Vocabulary maps: mapping words to integer IDs for NLP models.

The syntax is {key: value, key: value, ...}. Keys must be immutable (strings and numbers are common), but values can be anything.

# Model configuration (very common pattern)
config = {
    "model_name": "transformer-base",
    "hidden_size": 768,
    "num_layers": 12,
    "num_heads": 12,
    "learning_rate": 3e-4,
    "batch_size": 32,
    "max_epochs": 100,
    "dropout": 0.1,
    "optimizer": "adam",
}

print("Model Configuration:")
for key, value in config.items():
    print(f"  {key:>20}: {value}")

# Safe access with .get() (avoids KeyError)
weight_decay = config.get("weight_decay", 0.01)  # Default if missing
print(f"\nWeight decay: {weight_decay} (default used)")

# Update configuration
config.update({"learning_rate": 1e-4, "warmup_steps": 1000})
print(f"Updated LR: {config['learning_rate']}")
print(f"Warmup steps: {config['warmup_steps']}")

What just happened?

We created a dictionary that stores a model’s configuration. Let’s understand what each parameter means in the context of a real transformer model:

Parameter What it controls
hidden_size (768) The size of the internal representation vectors. Larger = more capacity but slower.
num_layers (12) How many transformer blocks are stacked. More layers = deeper understanding but more compute.
num_heads (12) How many attention mechanisms run in parallel. Lets the model focus on different aspects of the input simultaneously.
learning_rate (3e-4) How big each weight update step is. Too large = unstable training; too small = very slow training.
batch_size (32) How many samples are processed together. Larger = more stable gradients but needs more memory.
dropout (0.1) Randomly turns off 10% of neurons during training to prevent overfitting.

Two important dictionary methods: - .get(key, default) returns the value for key if it exists, or default if it doesn’t. This is safer than config[key], which raises a KeyError if the key is missing. - .update(other_dict) merges another dictionary into the current one, overwriting values for existing keys and adding new keys.

Common Mistake — config[key] vs config.get(key):
Using config["weight_decay"] when that key doesn’t exist will crash your program. Always use .get() with a sensible default for optional parameters.

Tracking Training Metrics

A very common pattern in ML is to use a dictionary where each key maps to a list that grows over time. During training, you append the loss and accuracy from each epoch to the corresponding list. At the end, you have a complete history you can plot or analyze.

This is exactly how libraries like Keras and PyTorch Lightning store training histories internally.

# Tracking training metrics (dictionary of lists)
history = {
    "train_loss": [],
    "val_loss": [],
    "train_acc": [],
    "val_acc": [],
}

import random
random.seed(42)

for epoch in range(1, 11):
    t_loss = 1.0 / (epoch * 0.5 + 0.3) + random.uniform(-0.02, 0.02)
    v_loss = t_loss + random.uniform(0.01, 0.08)
    t_acc = 1 - t_loss * 0.4 + random.uniform(-0.01, 0.01)
    v_acc = t_acc - random.uniform(0.01, 0.05)

    history["train_loss"].append(round(t_loss, 4))
    history["val_loss"].append(round(v_loss, 4))
    history["train_acc"].append(round(t_acc, 4))
    history["val_acc"].append(round(v_acc, 4))

print(f"{'Epoch':>6} {'T.Loss':>8} {'V.Loss':>8} {'T.Acc':>8} {'V.Acc':>8}")
print("-" * 42)
for i in range(len(history["train_loss"])):
    print(f"{i+1:>6} {history['train_loss'][i]:>8.4f} {history['val_loss'][i]:>8.4f} "
          f"{history['train_acc'][i]:>8.4f} {history['val_acc'][i]:>8.4f}")

best_epoch = history["val_loss"].index(min(history["val_loss"])) + 1
print(f"\nBest epoch by val_loss: {best_epoch}")

What just happened?

We simulated 10 epochs of training. In each epoch, we computed fake training and validation metrics and appended them to the corresponding lists in our history dictionary. The table shows how training loss goes down and accuracy goes up over time — exactly what you want to see in a real training run.

Notice that the validation loss (V.Loss) is always slightly higher than the training loss (T.Loss). This is normal and expected: the model is always a little better on data it has seen (training) than data it hasn’t (validation). If the gap grows too large, it signals overfitting.

The final line finds the epoch with the lowest validation loss using .index(min(...)). This tells us which checkpoint to keep.

✍️ Try it yourself

Can you find the epoch with the highest validation accuracy? Hint: use history["val_acc"].index(max(history["val_acc"])) + 1. Is it the same epoch as the one with the best validation loss?


3. Sets and Tuples

Python has two more collection types that round out your toolkit:

Sets hold unique elements with no duplicates and no ordering. Think of a set like a bag of marbles where each marble is a different color — if you try to add a second red marble, the bag just ignores it. Sets are great for vocabulary building, deduplication, and fast membership testing (“Is this word in my vocabulary?”).

Tuples are like lists, but immutable — once created, they cannot be changed. Think of a tuple as a sealed envelope: you can read what’s inside, but you can’t add, remove, or modify anything. This makes tuples perfect for data that should never change, like image dimensions (224, 224, 3) or coordinates. Immutability also means tuples can be used as dictionary keys, which lists cannot.

# Sets: unique elements, fast membership testing
train_words = {"the", "cat", "sat", "on", "the", "mat"}
test_words = {"the", "dog", "ran", "on", "the", "mat"}

print(f"Train vocabulary: {train_words}")
print(f"Test vocabulary:  {test_words}")
print(f"Shared (intersection): {train_words & test_words}")
print(f"All (union):           {train_words | test_words}")
print(f"Only in test (new):    {test_words - train_words}")

# Practical: find out-of-vocabulary words
known_vocab = {"hello", "world", "machine", "learning", "deep", "neural"}
input_text = "deep machine learning is neural magic"
input_words = set(input_text.split())
oov = input_words - known_vocab
print(f"\nOut-of-vocabulary words: {oov}")

What just happened?

Notice that even though we wrote "the" twice in train_words, the set only contains it once — sets automatically deduplicate. The set operations are powerful:

  • & (intersection) finds words that appear in both sets.
  • | (union) combines all words from both sets.
  • - (difference) finds words in one set but not the other.

The out-of-vocabulary (OOV) example is directly relevant to NLP. When your model encounters a word it wasn’t trained on, it needs a strategy to handle it (like using a special [UNK] token). Finding OOV words before training helps you decide how to handle them.

# Tuples: immutable, used for fixed data
image_shape = (224, 224, 3)   # height, width, channels
batch_shape = (32, *image_shape)  # batch_size + image_shape

print(f"Image shape: {image_shape}")
print(f"Batch shape: {batch_shape}")

# Tuple unpacking (very Pythonic)
height, width, channels = image_shape
print(f"Image: {width}x{height} with {channels} channels")

# Named tuples for clarity
from collections import namedtuple

ModelResult = namedtuple("ModelResult", ["accuracy", "loss", "f1_score"])
result = ModelResult(accuracy=0.94, loss=0.18, f1_score=0.91)

print(f"\nModel Result:")
print(f"  Accuracy: {result.accuracy:.0%}")
print(f"  Loss:     {result.loss}")
print(f"  F1:       {result.f1_score}")

What just happened?

image_shape = (224, 224, 3) is a tuple representing the dimensions of an image: 224 pixels tall, 224 pixels wide, and 3 color channels (red, green, blue). The * in (32, *image_shape) “unpacks” the tuple, so batch_shape becomes (32, 224, 224, 3) — a batch of 32 images.

Tuple unpackingheight, width, channels = image_shape — assigns each element to a separate variable in one line. This is cleaner than writing height = image_shape[0], width = image_shape[1], etc.

Named tuples are a step up from regular tuples. Instead of accessing elements by index (result[0]), you access them by name (result.accuracy). This makes code much more readable and self-documenting. They’re perfect for returning multiple values from a function.

When to use what? - List: ordered, mutable, for data that changes (training data, predictions, histories) - Tuple: ordered, immutable, for data that shouldn’t change (shapes, coordinates, configs) - Set: unordered, unique, for membership testing and deduplication (vocabularies, labels) - Dictionary: key-value mapping, for named access (configs, results, JSON data)

✍️ Try it yourself

Try modifying the tuple: image_shape[0] = 256. What happens? You’ll get a TypeError because tuples are immutable. Now try the same with a list — shape_list = [224, 224, 3]; shape_list[0] = 256 — and notice that it works fine.


4. Functions

Functions are reusable recipes. Just like a cooking recipe takes ingredients (inputs), follows a set of steps, and produces a dish (output), a Python function takes arguments, executes code, and returns a result. The power of functions is that you write the recipe once and use it as many times as you want with different ingredients.

In AI projects, you’ll write functions for data loading, preprocessing, model building, training steps, evaluation, and visualization. Good functions are:

  1. Focused: each function does one thing well.
  2. Documented: a docstring explains what the function does, what it takes, and what it returns.
  3. Reusable: you can call it from different parts of your code without rewriting logic.

Here’s the basic anatomy:

def function_name(required_arg, optional_arg=default_value):
    """Docstring: explain what this function does."""
    # ... do work ...
    return result

Why do docstrings matter? In large AI codebases with hundreds of functions, docstrings are how you (and your teammates) remember what each function does without reading its implementation. Tools like Jupyter can display docstrings when you type function_name?.

def train_step(model_weights, data_batch, learning_rate=0.01):
    """Simulate a single training step.

    Args:
        model_weights: Current model weights (list of floats)
        data_batch: Training data batch (list of values)
        learning_rate: Step size for weight update

    Returns:
        tuple: (updated_weights, loss)
    """
    predictions = [w * d for w, d in zip(model_weights, data_batch)]
    loss = sum((p - 1.0) ** 2 for p in predictions) / len(predictions)

    gradients = [2 * (p - 1.0) * d for p, d in zip(predictions, data_batch)]
    updated_weights = [w - learning_rate * g for w, g in zip(model_weights, gradients)]

    return updated_weights, loss


# Run a mini training loop
weights = [0.5, 0.3, 0.8]
data = [1.0, 2.0, 0.5]

print(f"Initial weights: {weights}")
for step in range(5):
    weights, loss = train_step(weights, data)
    print(f"Step {step + 1}: loss={loss:.4f}, weights={[f'{w:.4f}' for w in weights]}")

What just happened?

We defined a function called train_step that simulates one step of gradient descent. Let’s break down the key concepts:

  1. Parameters: model_weights and data_batch are required; learning_rate=0.01 has a default value, making it optional.
  2. Return value: The function returns a tuple of (updated_weights, loss). We unpack it with weights, loss = train_step(...). This pattern of returning multiple values in a tuple is extremely common in ML code.
  3. Scope: Variables created inside the function (like predictions, gradients) exist only within that function. They don’t leak into the rest of your code, which prevents accidental conflicts. This is called local scope.

Watch the loss decrease from step to step — the weights are being adjusted to produce better predictions. This is the fundamental loop of all AI training: predict, measure error, adjust, repeat.

*args and **kwargs: Flexible Function Signatures

Sometimes you don’t know in advance exactly which parameters a function will receive. Python provides two special mechanisms for this:

  • *args collects any number of positional arguments into a tuple. Think of it as “give me everything else that was passed in.”
  • **kwargs collects any number of keyword arguments into a dictionary. Think of it as “give me all the named options.”

This flexibility is essential in AI frameworks. For example, when creating a neural network layer, there are many optional parameters (activation function, dropout rate, whether to use bias). Using **kwargs lets you handle any combination without writing dozens of separate parameters.

# *args and **kwargs: flexible function signatures

def create_layer(input_size, output_size, **kwargs):
    """Create a layer config with optional parameters."""
    layer = {
        "input_size": input_size,
        "output_size": output_size,
        "activation": kwargs.get("activation", "relu"),
        "dropout": kwargs.get("dropout", 0.0),
        "bias": kwargs.get("bias", True),
    }
    return layer

# Flexible usage patterns
layer1 = create_layer(768, 256)
layer2 = create_layer(256, 128, activation="gelu", dropout=0.1)
layer3 = create_layer(128, 10, activation="softmax", bias=False)

for i, layer in enumerate([layer1, layer2, layer3], 1):
    print(f"Layer {i}: {layer['input_size']} -> {layer['output_size']} "
          f"({layer['activation']}, dropout={layer['dropout']})")

What just happened?

The function create_layer requires two arguments (input_size and output_size) but accepts any number of optional keyword arguments via **kwargs. Inside the function, kwargs is just a regular dictionary, so we use .get() to extract values with sensible defaults.

Look at the three different calls: - layer1 uses only the required arguments — all optional parameters get their defaults. - layer2 specifies activation and dropout — other options stay at defaults. - layer3 changes activation and bias — a completely different combination.

This is exactly how real frameworks like PyTorch’s nn.Linear work: you specify the essentials and override defaults only when needed.

Lambda Functions and Higher-Order Functions

A lambda is a tiny, anonymous function written in a single line. It’s useful when you need a quick throwaway function, especially for sorting or filtering. A higher-order function is a function that takes another function as an argument — sorted(), filter(), and map() are the most common examples.

Think of it this way: sorted() knows how to sort, but it needs you to tell it what to sort by. That’s what the key argument is for — you pass in a small function that extracts the comparison value.

# Lambda functions and higher-order functions
# Compact functions for transformations and sorting

results = [
    {"model": "A", "accuracy": 0.92, "latency_ms": 45},
    {"model": "B", "accuracy": 0.95, "latency_ms": 120},
    {"model": "C", "accuracy": 0.89, "latency_ms": 15},
    {"model": "D", "accuracy": 0.93, "latency_ms": 80},
]

# Sort by accuracy (descending)
by_accuracy = sorted(results, key=lambda r: r["accuracy"], reverse=True)
print("Ranked by accuracy:")
for r in by_accuracy:
    print(f"  {r['model']}: {r['accuracy']:.0%} ({r['latency_ms']}ms)")

# Filter: only fast models
fast_models = list(filter(lambda r: r["latency_ms"] < 50, results))
print(f"\nFast models (<50ms): {[r['model'] for r in fast_models]}")

# Map: extract just scores
accuracies = list(map(lambda r: r["accuracy"], results))
print(f"All accuracies: {accuracies}")
print(f"Mean accuracy: {sum(accuracies) / len(accuracies):.2%}")

What just happened?

  • sorted(..., key=lambda r: r["accuracy"]): “Sort this list of dictionaries by the accuracy value inside each one.” The lambda is a tiny function that takes a dictionary r and returns r["accuracy"].
  • filter(lambda r: r["latency_ms"] < 50, results): “Keep only the results where latency is under 50ms.” This returns an iterator, so we wrap it in list().
  • map(lambda r: r["accuracy"], results): “Extract the accuracy from each result.” Again, returns an iterator.

In practice, many Python developers prefer list comprehensions over filter() and map() for readability: [r["accuracy"] for r in results] does the same thing as the map call above. Use whichever feels clearer to you.

✍️ Try it yourself

Add a fifth model to the results list with your own accuracy and latency values. Then sort the results by latency (ascending) instead of accuracy. Which model is the fastest?


5. Error Handling

Errors will happen. Files go missing. APIs time out. Data arrives in unexpected formats. A user passes in the wrong type of argument. A good program doesn’t crash when things go wrong — it handles errors gracefully, logs a useful message, and either recovers or fails with dignity.

Python uses the try/except pattern for error handling:

try:
    # Code that might fail
except SomeError:
    # What to do if it fails

Think of it like a safety net under a tightrope walker. The try block is the tightrope. If the walker falls (an error occurs), the except block catches them instead of letting them hit the ground (your program crashing).

Here are the most common exception types you’ll encounter in AI code:

Exception When it happens
ZeroDivisionError Dividing by zero (e.g., computing accuracy on an empty dataset)
TypeError Using the wrong type (e.g., adding a string to a number)
ValueError Right type but wrong value (e.g., int("hello"))
KeyError Accessing a missing dictionary key
FileNotFoundError Opening a file that doesn’t exist
IndexError Accessing a list index that’s out of range
def safe_divide(a, b):
    """Divide a by b, handling edge cases gracefully."""
    try:
        result = a / b
    except ZeroDivisionError:
        print(f"  Warning: division by zero ({a}/{b}), returning 0")
        return 0.0
    except TypeError as e:
        print(f"  Error: invalid types - {e}")
        return None
    return result

print("Safe division tests:")
print(f"  10 / 3 = {safe_divide(10, 3):.4f}")
print(f"  10 / 0 = {safe_divide(10, 0)}")
print(f"  'a' / 2 = {safe_divide('a', 2)}")

What just happened?

We tested three scenarios: 1. Normal case (10 / 3): The try block succeeds, no exceptions are raised, and the result is returned normally. 2. Division by zero (10 / 0): A ZeroDivisionError is caught, we print a warning, and return 0.0 as a safe fallback. 3. Wrong type (‘a’ / 2): A TypeError is caught because you can’t divide a string by a number.

Notice that each except block handles a specific exception type. This is important because different errors may need different responses. Catching a generic Exception should be a last resort — it’s better to handle specific cases explicitly.

Real-World Example: Loading Configuration Files

In AI projects, you frequently load configuration from JSON files. Many things can go wrong: the file might not exist, it might contain invalid JSON, or you might not have permission to read it. Good error handling makes your program robust against all of these.

def load_config(filepath):
    """Load a config file with proper error handling."""
    import json

    try:
        with open(filepath, 'r') as f:
            config = json.load(f)
        print(f"Config loaded from {filepath}")
        return config
    except FileNotFoundError:
        print(f"Config file not found: {filepath}")
        print("Using default configuration.")
        return {"model": "default", "epochs": 10, "lr": 0.001}
    except json.JSONDecodeError as e:
        print(f"Invalid JSON in {filepath}: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {type(e).__name__}: {e}")
        return None


config = load_config("nonexistent_config.json")
print(f"Config: {config}")

What just happened?

Since the file nonexistent_config.json doesn’t exist, the FileNotFoundError handler kicked in and returned a default configuration instead of crashing. This is a very practical pattern: try to load the user’s config, but fall back to sensible defaults if it’s missing.

The exception handlers are ordered from most specific to most general. Python checks them top to bottom and uses the first one that matches. The generic Exception at the bottom is a catch-all for anything unexpected.

Common Mistake — Bare except:
Never write except: without specifying an exception type. A bare except catches everything, including keyboard interrupts and system exits, making it impossible to stop your program. Always catch specific exception types.

✍️ Try it yourself

Write a function safe_parse_int(value) that tries to convert value to an integer using int(). If it fails (e.g., value is "hello"), catch the ValueError and return None instead. Test it with "42", "hello", and 3.14.


6. Modules and Imports

Python’s power comes from its ecosystem. The language itself is relatively simple, but the thousands of libraries (called modules or packages) built on top of it are what make Python the language of choice for AI. NumPy for numerical computing, Pandas for data manipulation, PyTorch for deep learning, scikit-learn for classical ML — all of these are modules you import into your code.

Python also comes with a rich standard library — modules that are included with every Python installation, no extra installation needed. These cover file handling, math, date/time, random number generation, data serialization, and much more.

The import statement brings a module into your code:

import math                    # Import the whole module
from collections import Counter # Import one specific thing
import numpy as np              # Import with an alias (shortcut)

# Standard library modules you'll use constantly
import os
import sys
import json
import math
import random
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime

# os and pathlib: file system operations
current_dir = Path.cwd()
print(f"Current directory: {current_dir}")
print(f"Python version: {sys.version.split()[0]}")

# math: mathematical operations
print(f"\nlog2(1024) = {math.log2(1024)}")
print(f"sqrt(144) = {math.sqrt(144)}")
print(f"e = {math.e:.6f}")
print(f"pi = {math.pi:.6f}")

# Counter: frequency counting (NLP essential)
words = "the cat sat on the mat the cat sat".split()
word_counts = Counter(words)
print(f"\nWord frequencies: {dict(word_counts.most_common(3))}")

# defaultdict: auto-initialize dictionary values
category_items = defaultdict(list)
data = [("fruit", "apple"), ("veggie", "carrot"), ("fruit", "banana"), ("veggie", "pea")]
for category, item in data:
    category_items[category].append(item)
print(f"Categories: {dict(category_items)}")

What just happened?

We imported several standard library modules and demonstrated their most common uses:

  • os and pathlib: File system operations. Path.cwd() gives the current working directory. You’ll use these to find dataset files, create output directories, and manage model checkpoints.
  • math: Mathematical functions like log2, sqrt, and constants like e and pi. These come up in loss function calculations and information theory.
  • Counter: A dictionary subclass that counts things automatically. Pass it a list and it returns a dictionary of {item: count} pairs. This is invaluable for NLP (word frequency), data exploration (label distribution), and debugging (checking for class imbalance).
  • defaultdict: A dictionary that automatically creates a default value for any key you access. defaultdict(list) creates an empty list whenever you access a new key, so you can .append() without checking if the key exists first.

These standard library tools replace dozens of lines of manual code. Learning them is a high-leverage investment.

✍️ Try it yourself

Use Counter to find the most common characters (not words) in the string "machine learning is amazing". Hint: a string is already an iterable of characters, so you can pass it directly to Counter.


7. Practical Pattern: Data Pipeline

Let’s combine everything we’ve learned into a practical data processing pipeline — the kind you’ll build in every ML project. The typical pipeline has three stages:

  1. Generate or load data: Get raw data from files, databases, or APIs.
  2. Split into train/validation/test: Divide the data so you can train on one portion and evaluate on another.
  3. Compute statistics: Understand the distribution of your data before training.

We’ll use synthetic data here (randomly generated), but the structure is identical to what you’d use with real data. Each function is self-contained and does one thing well — this modular design makes the pipeline easy to test, debug, and extend.

import random
from collections import Counter


def generate_dataset(n_samples=100, seed=42):
    """Generate a synthetic classification dataset."""
    random.seed(seed)
    dataset = []
    for _ in range(n_samples):
        features = {
            "age": random.randint(18, 80),
            "income": random.randint(20000, 150000),
            "education_years": random.randint(8, 22),
        }
        score = (features["income"] / 150000 * 0.5 +
                 features["education_years"] / 22 * 0.3 +
                 (1 - abs(features["age"] - 45) / 35) * 0.2)
        label = 1 if score + random.uniform(-0.15, 0.15) > 0.55 else 0
        dataset.append({**features, "label": label})
    return dataset


def split_dataset(data, train_ratio=0.7, val_ratio=0.15):
    """Split data into train/val/test sets."""
    shuffled = data.copy()
    random.shuffle(shuffled)
    n = len(shuffled)
    train_end = int(n * train_ratio)
    val_end = int(n * (train_ratio + val_ratio))
    return shuffled[:train_end], shuffled[train_end:val_end], shuffled[val_end:]


def compute_stats(data):
    """Compute summary statistics for a dataset."""
    labels = [d["label"] for d in data]
    label_dist = Counter(labels)

    stats = {}
    for key in ["age", "income", "education_years"]:
        values = [d[key] for d in data]
        stats[key] = {
            "min": min(values),
            "max": max(values),
            "mean": sum(values) / len(values),
        }
    return {"label_distribution": dict(label_dist), "feature_stats": stats}


# Build the pipeline
dataset = generate_dataset(200)
train, val, test = split_dataset(dataset)

print(f"Dataset: {len(dataset)} samples")
print(f"Train: {len(train)} | Val: {len(val)} | Test: {len(test)}")
print()

for name, split in [("Train", train), ("Val", val), ("Test", test)]:
    stats = compute_stats(split)
    dist = stats["label_distribution"]
    print(f"{name:>5} set: {dist.get(0, 0)} negative, {dist.get(1, 0)} positive "
          f"({dist.get(1, 0) / len(split):.0%} positive rate)")

What just happened?

We built a complete data pipeline in three functions:

  1. generate_dataset() creates synthetic data with three features (age, income, education_years) and a binary label. The label is determined by a scoring formula with some random noise added. random.seed(42) ensures reproducibility — running this again will produce the exact same dataset.

  2. split_dataset() shuffles the data and divides it into three parts: 70% for training, 15% for validation, and 15% for testing. The validation set is used during training to monitor for overfitting. The test set is held out until the very end to get an unbiased estimate of model performance. We use .copy() before shuffling to avoid modifying the original data.

  3. compute_stats() calculates summary statistics for each split. The positive rate (percentage of positive labels) is especially important — if your splits have very different positive rates, your evaluation could be misleading. Ideally, all three splits should have similar distributions.

This pipeline structure — load, split, analyze — is the starting point of virtually every ML project.

✍️ Try it yourself

Change the train_ratio to 0.8 and val_ratio to 0.1. How do the split sizes change? In practice, the right split ratio depends on your dataset size: smaller datasets need a larger training portion to learn effectively.


What’s Next?

In Notebook 03 (Advanced), we’ll cover:

  • Object-Oriented Programming (classes, inheritance)
  • File I/O (CSV, JSON, text processing)
  • Generators and iterators
  • Decorators
  • A capstone mini-project bringing it all together

Generated by Berta AI | Created by Luigi Pascal Rondanini


Back to Ch 1 overview | Try in Playground | View on GitHub