Skip to content

Ch 1: Python Fundamentals for AI - Introduction

Track: Foundation | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-01-python-fundamentals/notebooks/01_introduction.ipynb in Jupyter.


Chapter 1: Python Fundamentals for AI

Notebook 01 — Introduction

Welcome to the first chapter of the Berta Chapters curriculum! This notebook will take you from zero to confident Python basics.

What you’ll learn: - Variables and data types - Operators and expressions - Strings and string manipulation - Basic input/output - Control flow (if/elif/else) - Loops (for and while) - List comprehensions

Time estimate: 2.5 hours


Generated by Berta AI | Created by Luigi Pascal Rondanini

0. Welcome & How to Use This Notebook

You are reading a Jupyter Notebook — an interactive document that mixes text (like the paragraph you’re reading now) with runnable code. Think of it as a lab notebook for programmers: you can read an explanation, run a small experiment, see the result, and then move on. Notebooks are the standard working environment for data scientists and AI researchers because they let you explore ideas step by step instead of running an entire program at once.

Each grey box below is called a cell. Text cells (like this one) explain concepts. Code cells contain Python you can execute. To run a code cell, click on it and press Shift + Enter. The output will appear directly below the cell. You can also re-run cells, change the code, and experiment freely — nothing you do here is permanent, and breaking things is a great way to learn!

Tip: Work through this notebook from top to bottom. Later cells sometimes depend on variables created in earlier cells, so running them out of order may cause errors.


1. Variables and Data Types

Think of a variable as a labeled container — like a box with a name written on the outside. When you write learning_rate = 0.001, you’re telling Python: “Create a container called learning_rate and put the number 0.001 inside it.” Later, whenever you refer to learning_rate, Python looks inside that box and uses whatever value it finds. You can also swap out the contents at any time, which is why we call it a variable — its value can vary.

Python is dynamically typed, which means you never have to declare “this variable is an integer” or “this variable is a string.” Python figures it out automatically from the value you assign. This keeps your code short and readable, but it also means you need to be mindful of what’s actually inside each container.

In AI work, you will use five core data types almost every day:

Type What it holds AI example
int Whole numbers Number of training epochs, batch size, layer count
float Decimal numbers Learning rate, model weights, loss values
str Text Model names, file paths, prompts, labels
bool True or False Feature flags like use_gpu, condition results
None “Nothing yet” Placeholder before a model is loaded

Let’s see all five in action:

# Integers - whole numbers, used for counts, indices, epochs
num_epochs = 10
batch_size = 32

# Floats - decimal numbers, used for weights, losses, learning rates
learning_rate = 0.001
accuracy = 0.95

# Strings - text, used for labels, file paths, model names
model_name = "gpt-4"
dataset_path = "data/training.csv"

# Booleans - True/False, used for flags, conditions
is_training = True
use_gpu = False

# None - represents absence of value
best_model = None

print(f"Training {model_name} for {num_epochs} epochs")
print(f"Learning rate: {learning_rate}")
print(f"GPU enabled: {use_gpu}")
print(f"Type of learning_rate: {type(learning_rate).__name__}")

What just happened?

We created seven variables, each holding a different type of data. Notice how Python didn’t ask us to say int num_epochs = 10 — it simply looked at the value 10 and figured out it was an integer on its own. The print statements at the end demonstrate that we can mix text and variables together using f-strings (we’ll cover those in depth in Section 3).

Also notice best_model = None. This is a very common pattern in AI code: you create a placeholder variable before you have a real value for it. Later, after training finishes, you might write best_model = trained_model to fill that container.

Common Mistake — = vs ==:
A single equals sign = means “assign this value.” A double equals sign == means “check if these are equal.” Mixing them up is one of the most common beginner errors:

x = 5      # Assignment: put 5 into x
x == 5     # Comparison: is x equal to 5? (returns True)
If you accidentally write if x = 5: instead of if x == 5:, Python will raise a SyntaxError.

✍️ Try it yourself

In the cell above, try changing model_name to your favorite AI model and num_epochs to a different number, then re-run the cell. What changes in the output?


Type Checking and Conversion

Data from files and APIs almost always arrives as strings — even when it represents a number. Imagine you read the text "100" from a CSV file. To Python, that’s just five characters, not the number one hundred. Before you can do math with it, you need to convert (or cast) it to an int or a float. This is a daily task in AI data pipelines: loading raw data and converting it into the types your model needs.

Python gives you built-in functions for this: int(), float(), str(), and bool(). But watch out — conversion has quirks. The most common surprise is that int() truncates (chops off the decimal) rather than rounding. So int(3.7) gives you 3, not 4. If you want proper rounding, use round() instead.

# Type checking
value = 42.0
print(f"Is float? {isinstance(value, float)}")
print(f"Is number? {isinstance(value, (int, float))}")

# Type conversion (casting)
raw_input = "100"          # Data often comes as strings
num_samples = int(raw_input)  # Convert to int for math
ratio = float("0.75")        # Convert to float

print(f"\nSamples: {num_samples} (type: {type(num_samples).__name__})")
print(f"Ratio: {ratio} (type: {type(ratio).__name__})")

# Be careful with conversions
print(f"\nint(3.7) = {int(3.7)}")     # Truncates, doesn't round
print(f"round(3.7) = {round(3.7)}")   # Rounds properly
print(f"bool(0) = {bool(0)}")         # 0 is falsy
print(f"bool(1) = {bool(1)}")         # Non-zero is truthy
print(f"bool('') = {bool('')}")       # Empty string is falsy

What just happened?

We used isinstance() to ask Python “Is this value a float?” without changing the value itself. This is handy for defensive coding — checking that your data is the right type before you operate on it.

Then we converted the string "100" to the integer 100 and the string "0.75" to the float 0.75. Notice how the type changed (shown in the output), but the logical meaning stayed the same.

The last few lines highlight Python’s concept of truthiness: 0, 0.0, empty strings "", None, and empty collections are all considered False in a boolean context. Everything else is True. This matters a lot in AI code because you’ll often write conditions like if data: to check whether a list has any elements.

✍️ Try it yourself

What happens if you try int("hello")? Add a line to the code cell above and run it. You’ll get a ValueError — Python can’t turn arbitrary text into a number, and it tells you so loudly.


2. Operators and Expressions

Operators are the verbs of programming — they do things to values. Just as you use + in everyday arithmetic, Python provides operators for addition, subtraction, comparison, and logical reasoning. In AI, operators are everywhere: computing losses, updating weights, checking whether a model improved, and deciding what to do next.

There are three families of operators you’ll use constantly:

  1. Arithmetic operators (+, -, *, /, //, %, **) — do math.
  2. Comparison operators (<, >, <=, >=, ==, !=) — compare values and return True or False.
  3. Logical operators (and, or, not) — combine boolean conditions.

Let’s start with arithmetic:

# Arithmetic operators
a, b = 17, 5
print(f"{a} + {b} = {a + b}")    # Addition
print(f"{a} - {b} = {a - b}")    # Subtraction
print(f"{a} * {b} = {a * b}")    # Multiplication
print(f"{a} / {b} = {a / b}")    # True division (always float)
print(f"{a} // {b} = {a // b}")  # Floor division
print(f"{a} % {b} = {a % b}")    # Modulo (remainder)
print(f"{a} ** {b} = {a ** b}")  # Power

# AI-relevant example: manual gradient descent step
weight = 0.5
gradient = 0.1
lr = 0.01
weight = weight - lr * gradient
print(f"\nUpdated weight: {weight}")

What just happened?

The first block shows every arithmetic operator Python offers. Pay special attention to two of them:

  • / (true division) always returns a float, even if the result is a whole number. 10 / 5 gives 2.0, not 2.
  • // (floor division) divides and then rounds down to the nearest integer. 17 // 5 is 3, not 3.4.

The last few lines show a simplified gradient descent step. Gradient descent is the core algorithm behind training almost every AI model. Here’s the idea in plain English: your model has a weight (a number it uses to make predictions). After each prediction, you measure how wrong it was and compute a gradient (the direction and size of the error). Then you nudge the weight a tiny bit in the opposite direction. The learning rate (lr) controls how big that nudge is. The line weight = weight - lr * gradient is the single most important equation in modern AI.

Common Mistake — / vs //:
If you use / when you meant //, you’ll get a float where you expected an integer. This can cause subtle bugs, especially when using a result as a list index (indices must be integers). For example, my_list[len(my_list) / 2] will crash — you need my_list[len(my_list) // 2].

Comparison and Logical Operators

Comparison operators ask a yes-or-no question and return True or False. In AI, you compare metrics constantly: “Did the loss go down? Is the accuracy above our threshold? Has the model improved?” Logical operators let you combine those questions: “Do we have a GPU and enough memory?”

# Comparison operators (return True/False)
loss_current = 0.35
loss_previous = 0.42
threshold = 0.01

print(f"Loss improved? {loss_current < loss_previous}")
improvement = loss_previous - loss_current
print(f"Improvement: {improvement:.4f}")
print(f"Significant improvement? {improvement > threshold}")

# Logical operators
has_gpu = True
large_model = True
enough_memory = False

can_train = has_gpu and large_model and enough_memory
should_try = has_gpu or not large_model

print(f"\nCan train? {can_train}")
print(f"Should try? {should_try}")

What just happened?

We checked whether the current loss is lower than the previous loss (it is — 0.35 < 0.42 is True). Then we calculated the actual improvement and asked whether it exceeded our threshold. These are exactly the kinds of checks that happen inside a training loop.

For the logical operators, notice that can_train is False because all three conditions must be True for and to return True, and enough_memory is False. Meanwhile, should_try is True because or only needs at least one condition to be True.

✍️ Try it yourself

Change enough_memory to True in the cell above and re-run. What happens to can_train?


3. Strings

In AI, most of the world’s data is text. Every prompt you send to an LLM, every label in a dataset, every file path, every log message — all strings. Natural Language Processing (NLP) is an entire subfield of AI dedicated to understanding and generating text. Master string manipulation now and you’ll save yourself hours of debugging later.

A string in Python is simply a sequence of characters wrapped in quotes. You can use single quotes 'hello', double quotes "hello", or triple quotes """hello""" for multi-line text. They all create the same type of object.

# String creation
prompt = "Explain quantum computing in simple terms"
system_msg = 'You are a helpful AI assistant'
multiline = """This is a multi-line string.
Useful for prompts, docstrings, and templates.
Very common in LLM applications."""

print(prompt)
print(f"Length: {len(prompt)} characters")
print(f"Words: {len(prompt.split())}")

What just happened?

We created three strings using three different quote styles. len() tells us how many characters are in the string, and .split() breaks the string into a list of words (splitting on spaces by default), so len(prompt.split()) counts the words. These two operations — character count and word count — are among the first things you’ll do in any NLP pipeline.

f-strings: The Modern Way to Format Text

An f-string (formatted string literal) lets you embed variables and expressions directly inside a string by putting an f before the opening quote and wrapping expressions in curly braces {}. This is Python’s recommended way to build strings that include dynamic data. Let’s start with a simple example before moving to a more realistic one:

name = "Alice"
age = 30
print(f"My name is {name} and I am {age} years old.")
# Output: My name is Alice and I am 30 years old.

Inside the curly braces, you can also add format specifiers to control how numbers appear. For example, :.4f means “show 4 decimal places” and :.2% means “format as a percentage with 2 decimal places.” These are invaluable for clean training logs.

# f-strings: the modern way to format strings in Python
model = "llama-3"
epoch = 5
loss = 0.0234
accuracy = 0.9567

# Clean formatting for logs
log_line = f"[{model}] Epoch {epoch:03d} | Loss: {loss:.4f} | Acc: {accuracy:.2%}"
print(log_line)

# String methods you'll use constantly
text = "  Hello, World! This is Berta.  "
print(f"Strip:    '{text.strip()}'")
print(f"Lower:    '{text.strip().lower()}'")
print(f"Replace:  '{text.strip().replace('World', 'AI')}'")
print(f"Split:    {text.strip().split()}") 
print(f"Starts:   {text.strip().startswith('Hello')}")
print(f"Find 'Berta': position {text.find('Berta')}")

What just happened?

The log line demonstrates several format specifiers at once: {epoch:03d} pads the epoch number with leading zeros to three digits (so epoch 5 shows as 005), {loss:.4f} shows the loss to four decimal places, and {accuracy:.2%} converts 0.9567 to the string 95.67%. You’ll write lines like this dozens of times in any training script.

The string methods below are your everyday text-processing toolkit: - .strip() removes leading and trailing whitespace (dirty data from files often has extra spaces). - .lower() converts to lowercase (essential for consistent text matching). - .replace() swaps one substring for another. - .split() breaks a string into a list of pieces. - .startswith() checks how a string begins.

String Slicing

Slicing lets you extract a piece of a string (or a list, or an array — the syntax is the same everywhere in Python). The key insight is to think of indices as positions between characters, not on them:

 M  a  c  h  i  n  e     L  e  a  r  n  i  n  g
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16

The syntax is text[start:stop:step]. start is inclusive, stop is exclusive, and step controls direction and skip. Negative indices count from the end: -1 is the last character, -2 is the second to last, and so on.

# Slicing: extracting parts of strings (and later, lists and arrays)
text = "Machine Learning"

print(f"First 7:     '{text[:7]}'")
print(f"Last 8:      '{text[-8:]}'")
print(f"Middle:      '{text[3:10]}'")
print(f"Every 2nd:   '{text[::2]}'")
print(f"Reversed:    '{text[::-1]}'")

# Practical: parsing file paths
filepath = "models/checkpoints/epoch_005_loss_0.023.pt"
filename = filepath.split("/")[-1]
extension = filename.split(".")[-1]
parts = filename.replace("."+extension, "").split("_")

print(f"\nFilename: {filename}")
print(f"Extension: {extension}")
print(f"Parts: {parts}")

What just happened?

The first half shows the main slicing patterns: - text[:7] — from the beginning up to (but not including) index 7. - text[-8:] — from 8 characters before the end through the end. - text[::2] — every second character. - text[::-1] — the entire string reversed (step of -1).

The filepath parsing example is something you’ll do regularly in AI projects. Model checkpoints are typically saved with structured filenames like epoch_005_loss_0.023.pt. By splitting on / and _, we can extract the epoch number and the loss value programmatically — useful for finding the best checkpoint across hundreds of saved files.

✍️ Try it yourself

Given the string "deep_learning_model_v2.pth", can you extract just "v2" using slicing or splitting? Try it in the cell above.


4. Control Flow

Programs need to make decisions, just like you decide which route to take when driving. Should the model save a checkpoint? Was the accuracy high enough? Is the data valid? Control flow statements — if, elif, and else — are how Python makes choices.

One thing that makes Python unique is that it uses indentation to show which code belongs together. Most languages use curly braces {} for this, but Python uses whitespace (typically 4 spaces). Everything indented under an if statement is the code that runs when the condition is True. This forces your code to be visually clean, which is a feature, not a bug!

The basic pattern looks like this:

if condition_a:
    # runs if condition_a is True
elif condition_b:
    # runs if condition_a was False AND condition_b is True
else:
    # runs if ALL conditions above were False

Let’s see a practical example. What does this code decide? It classifies a model’s accuracy into a human-readable verdict:

# if / elif / else: making decisions
accuracy = 0.87

if accuracy >= 0.95:
    verdict = "Excellent — production ready"
elif accuracy >= 0.85:
    verdict = "Good — consider fine-tuning"
elif accuracy >= 0.70:
    verdict = "Acceptable — needs improvement"
else:
    verdict = "Poor — revisit approach"

print(f"Accuracy: {accuracy:.0%} -> {verdict}")

What just happened?

Python checked each condition from top to bottom and stopped at the first one that was True. Since 0.87 >= 0.95 is False but 0.87 >= 0.85 is True, it assigned the “Good” verdict and skipped the rest. The elif conditions are evaluated in order, so only one block ever runs.

This pattern is very common in AI: after evaluating a model, you might take different actions depending on how well it performed. Below is a more realistic example.

Model Selection Logic

What does this code decide? Given a task, a dataset size, and whether labels are available, it recommends which AI approach to use. This kind of decision tree mirrors how experienced ML engineers think about problems.

# Practical: model selection logic
task = "text-classification"
dataset_size = 5000
has_labels = True

if task == "text-classification" and has_labels:
    if dataset_size > 10000:
        approach = "Fine-tune a pre-trained transformer"
    elif dataset_size > 1000:
        approach = "Few-shot learning with LLM + fine-tune small model"
    else:
        approach = "Zero-shot or few-shot with LLM API"
elif task == "text-classification" and not has_labels:
    approach = "Unsupervised clustering or LLM-based labeling"
else:
    approach = "Evaluate task requirements first"

print(f"Task: {task}")
print(f"Dataset: {dataset_size} samples, labeled: {has_labels}")
print(f"Recommended: {approach}")

What just happened?

This code contains nested if statements — an if inside another if. The outer level asks “what kind of task is this and do we have labels?” The inner level refines the recommendation based on how much data we have. These decisions matter because in real ML work, the right approach depends heavily on your specific situation:

  • Lots of labeled data (>10k)? Fine-tuning a pretrained model is usually best.
  • Moderate labeled data (1k–10k)? Combine LLM capabilities with a smaller fine-tuned model.
  • Very little data? Use a large language model with zero-shot or few-shot prompting.
  • No labels at all? You need unsupervised methods or to generate labels first.

✍️ Try it yourself

Change dataset_size to 50000 and re-run. Then try has_labels = False. How does the recommendation change?


5. Loops

A loop repeats code, which is essential because AI involves processing thousands (or millions) of data points. Imagine you have a dataset with 50,000 images — you’re not going to write 50,000 lines of code to process each one. Instead, you write the processing logic once and loop over every image.

Python has two kinds of loops:

  • for loop — use when you know what you’re iterating over (a list, a range of numbers, etc.). This is the loop you’ll use 95% of the time.
  • while loop — use when you don’t know how many times you’ll repeat; you keep going until a condition becomes False.

A good rule of thumb: use for when you know how many times, use while when you don’t.

# for loop: iterate over a sequence
models = ["gpt-4", "claude-3", "llama-3", "gemini-pro"]

print("Available models:")
for i, model in enumerate(models, 1):
    print(f"  {i}. {model}")

# range: generate number sequences
print("\nTraining epochs:")
for epoch in range(1, 6):
    # Simulated loss that decreases
    loss = 1.0 / (epoch * 0.8 + 0.5)
    bar = '#' * int(loss * 30)
    print(f"  Epoch {epoch:02d} | Loss: {loss:.4f} |{bar}")

What just happened?

The first loop uses enumerate(), which gives you both the index and the value as you iterate. The 1 argument tells it to start counting at 1 instead of 0 (handy for human-readable output). The second loop uses range(1, 6) to generate the numbers 1 through 5 and simulates how a model’s loss typically decreases over training epochs. The visual bar chart made of # characters is a quick way to visualize progress in the terminal.

While Loops and Early Stopping

Early stopping is one of the most important techniques in AI training. Here’s the problem it solves: when you train a model, it gets better and better at the training data. But at some point, it starts memorizing the training data instead of learning general patterns — this is called overfitting. When overfitting begins, the model’s performance on new, unseen data (the validation loss) starts getting worse even though the training loss keeps going down.

Early stopping monitors the validation loss and says: “If the loss hasn’t improved for N epochs in a row (the patience), stop training and use the best weights we’ve seen so far.” We use a while loop here because we don’t know in advance which epoch training will stop at — it depends on the data.

# while loop: repeat until a condition is met
# Simulating early stopping in training
patience = 3
no_improve_count = 0
best_loss = float('inf')
epoch = 0

# Simulated losses
losses = [0.9, 0.7, 0.5, 0.48, 0.47, 0.475, 0.478, 0.479]

print("Training with early stopping:")
while epoch < len(losses) and no_improve_count < patience:
    current_loss = losses[epoch]

    if current_loss < best_loss:
        best_loss = current_loss
        no_improve_count = 0
        status = "improved"
    else:
        no_improve_count += 1
        status = f"no improvement ({no_improve_count}/{patience})"

    print(f"  Epoch {epoch + 1}: loss={current_loss:.3f} | best={best_loss:.3f} | {status}")
    epoch += 1

if no_improve_count >= patience:
    print(f"\nEarly stopping triggered at epoch {epoch}!")
    print(f"Best loss: {best_loss:.3f}")

What just happened?

Look at the simulated losses: [0.9, 0.7, 0.5, 0.48, 0.47, 0.475, 0.478, 0.479]. The loss improved steadily through epoch 5 (reaching 0.47), but then started getting slightly worse: 0.475, 0.478, 0.479. After 3 consecutive epochs with no improvement (our patience limit), training stopped automatically. Without early stopping, the model would have continued training on data where it was already overfitting.

Notice best_loss = float('inf') at the start. Setting the initial “best” to infinity guarantees that the very first loss will always be an improvement, regardless of its value. This is a common initialization trick.

Loop Control: break and continue

Sometimes you need to alter the normal flow of a loop:

  • continue skips the rest of the current iteration and jumps to the next one. Think of it like fast-forwarding past a bad song in a playlist — you don’t stop listening, you just skip that track.
  • break exits the loop entirely. It’s like turning off the music player — no more songs at all.

In AI data pipelines, continue is especially useful for skipping corrupted or invalid data points without crashing the entire process.

# Loop control: break, continue
data_points = [0.5, 0.7, None, 0.3, "invalid", 0.9, 0.1]

clean_data = []
skipped = 0

for item in data_points:
    if item is None or not isinstance(item, (int, float)):
        skipped += 1
        continue  # Skip invalid data
    clean_data.append(item)

print(f"Original:  {data_points}")
print(f"Cleaned:   {clean_data}")
print(f"Skipped:   {skipped} invalid entries")

What just happened?

Our raw data contains a mix of valid floats, a None value, and the string "invalid". The loop checks each item: if it’s None or not a number, continue skips it and moves to the next item. Only valid numbers make it through to clean_data. This pattern is the essence of data cleaning — real-world datasets are messy, and your code needs to handle that gracefully.

✍️ Try it yourself

Add a few more messy data points to the data_points list — try adding an empty string "", a boolean True, or a negative number -0.5. Which ones get filtered out and which ones make it through? Does the behavior match your expectations?


6. List Comprehensions

List comprehensions are a Python shortcut for creating new lists by transforming or filtering existing ones. They’re extremely common in data science and ML code because they’re concise and readable (once you’re used to them).

Let’s see the long way first, then the short way. This is a pattern called “transform every element in a list”:

# Long way (regular for loop)
result = []
for score in raw_scores:
    result.append(score / 100)

# Short way (list comprehension)
result = [score / 100 for score in raw_scores]

Both produce the same result. The comprehension just compresses three lines into one. The general pattern is:

[expression for item in iterable if condition]

Read it aloud: “Give me expression for each item in iterable, but only if condition is true.”

# Basic comprehension: transform each element
raw_scores = [85, 92, 78, 95, 88, 73, 91]
normalized = [score / 100 for score in raw_scores]
print(f"Raw:        {raw_scores}")
print(f"Normalized: {[f'{n:.2f}' for n in normalized]}")

# Filtering: only keep elements meeting a condition
high_scores = [s for s in raw_scores if s >= 85]
print(f"High (>=85): {high_scores}")

# Transform + filter in one line
labels = ["cat", "dog", "CAT", "Dog", "bird", "BIRD", "cat"]
unique_labels = list(set(label.lower() for label in labels))
print(f"\nOriginal labels: {labels}")
print(f"Unique (lower):  {sorted(unique_labels)}")

What just happened?

Three list comprehension patterns in action:

  1. Transform: [score / 100 for score in raw_scores] divides every score by 100. The equivalent for-loop would be three lines.
  2. Filter: [s for s in raw_scores if s >= 85] keeps only scores of 85 or above.
  3. Transform + deduplicate: We lowercased all labels and passed them into a set to remove duplicates, then sorted the result.

The label deduplication example is very relevant to AI: datasets often contain inconsistent labels ("cat", "Cat", "CAT") that all mean the same thing. Normalizing to lowercase before deduplicating is a standard preprocessing step.

Nested Comprehensions and Dictionary Comprehensions

You can nest comprehensions to create 2D data structures (matrices) and you can also build dictionaries with a similar syntax. Dictionary comprehensions use {key: value for ...} instead of [value for ...].

Here’s the equivalent for-loop for a dictionary comprehension, so you can see what it replaces:

# Long way
model_scores = {}
for model, score in zip(models, scores):
    model_scores[model] = score

# Short way
model_scores = {model: score for model, score in zip(models, scores)}
# Nested comprehension: creating a matrix (2D data)
# This pattern appears in data preprocessing all the time
rows, cols = 3, 4
matrix = [[row * cols + col for col in range(cols)] for row in range(rows)]

print("Matrix:")
for row in matrix:
    print(f"  {row}")

# Flattening a matrix (common operation)
flat = [val for row in matrix for val in row]
print(f"\nFlattened: {flat}")

# Dictionary comprehension: mapping data
models = ["gpt-4", "claude-3", "llama-3"]
scores = [0.92, 0.95, 0.88]
model_scores = {model: score for model, score in zip(models, scores)}
print(f"\nModel scores: {model_scores}")

best = max(model_scores, key=model_scores.get)
print(f"Best model: {best} ({model_scores[best]:.0%})")

What just happened?

The nested comprehension creates a 3×4 matrix (a list of lists). In AI, you work with matrices constantly — they represent images, weight matrices, attention scores, and more. The flattening operation [val for row in matrix for val in row] reads left to right: “for each row, for each value in that row, give me the value.”

The dictionary comprehension {model: score for model, score in zip(models, scores)} pairs up model names with their scores using zip(). Then max(model_scores, key=model_scores.get) finds the model with the highest score — a one-liner you’ll use frequently to identify the best result.

Common Mistake — Readability:
Comprehensions are powerful, but if a comprehension gets longer than one line, consider using a regular for-loop instead. Readability counts! The goal is to write code that your future self (or a teammate) can understand in six months.

✍️ Try it yourself

Write a list comprehension that takes raw_scores = [85, 92, 78, 95, 88, 73, 91] and produces a list of strings: ["pass" if the score is >= 80, "fail" otherwise. Hint: you can use a conditional expression inside a comprehension: ["pass" if s >= 80 else "fail" for s in raw_scores].


7. Putting It Together: Mini Project

Let’s build something real. We’re going to create a simple text analyzer — a function that takes a piece of text and returns useful statistics about it. This is exactly the kind of preprocessing you’ll do constantly in Natural Language Processing (NLP): before feeding text to an AI model, you need to understand what you’re working with. How long is the text? How many unique words does it contain? What are the most common words?

This mini project combines almost everything we’ve learned so far: strings, loops, dictionaries, list comprehensions, f-strings, and functions. We’ll build it step by step.

def analyze_text(text):
    """Analyze a piece of text and return statistics."""
    words = text.lower().split()
    sentences = [s.strip() for s in text.split('.') if s.strip()]

    # Word frequency count
    word_freq = {}
    for word in words:
        clean = word.strip('.,!?;:"\'')  
        if clean:
            word_freq[clean] = word_freq.get(clean, 0) + 1

    # Sort by frequency
    sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)

    return {
        'char_count': len(text),
        'word_count': len(words),
        'sentence_count': len(sentences),
        'unique_words': len(word_freq),
        'avg_word_length': sum(len(w) for w in words) / max(len(words), 1),
        'top_words': sorted_words[:5],
    }


# Test with a sample text about AI
sample_text = """Artificial intelligence is transforming the world. 
Machine learning is a subset of artificial intelligence. 
Deep learning is a subset of machine learning. 
The field of AI is growing rapidly. 
AI systems can now understand language, recognize images, and make decisions."""

stats = analyze_text(sample_text)

print("Text Analysis Results")
print("=" * 40)
print(f"Characters:    {stats['char_count']}")
print(f"Words:         {stats['word_count']}")
print(f"Sentences:     {stats['sentence_count']}")
print(f"Unique words:  {stats['unique_words']}")
print(f"Avg word len:  {stats['avg_word_length']:.1f}")
print(f"\nTop 5 words:")
for word, count in stats['top_words']:
    bar = '|' * count
    print(f"  {word:>15} [{count}] {bar}")

What just happened?

Let’s walk through the analyze_text function piece by piece:

  1. Tokenization: text.lower().split() converts to lowercase and splits on whitespace. This gives us a list of words (tokens).
  2. Sentence splitting: We split on periods and filter out empty strings. This is a simplified sentence detector.
  3. Word frequency: We loop through all words, strip punctuation, and count how many times each word appears using a dictionary. The .get(key, 0) method returns 0 if the word isn’t in the dictionary yet — this avoids a KeyError.
  4. Sorting: sorted(..., key=lambda x: x[1], reverse=True) sorts word-count pairs by count, highest first.
  5. Return: We package everything into a dictionary of results.

This is a simplified version of what real NLP libraries like NLTK or spaCy do. In production, you’d handle punctuation more carefully, deal with contractions ("don’t" → "do not"), and use proper tokenizers. But the core logic is the same.

✍️ Try it yourself

Paste a paragraph of your own text into sample_text and re-run the analysis. Try text from different sources (a news article, a poem, a technical document) and compare the statistics. What do you notice about word frequency patterns?


What’s Next?

You’ve learned the absolute basics of Python! In Notebook 02 (Intermediate), we’ll cover:

  • Lists, dictionaries, sets, and tuples in depth
  • Functions: parameters, returns, scope, *args/**kwargs
  • Error handling with try/except
  • Working with modules and imports

These are the tools you’ll use in every single AI project.


Generated by Berta AI | Created by Luigi Pascal Rondanini


Back to Ch 1 overview | Try in Playground | View on GitHub